Search Results: "Vincent Bernat"

25 May 2021

Vincent Bernat: Jerikan+Ansible: a configuration management system for network

There are many resources for network automation with Ansible. Most of them only expose the first steps or limit themselves to a narrow scope. They give no clue on how to expand from that. Real network environments may be large, versatile, heterogeneous, and filled with exceptions. The lack of real-world examples for Ansible deployments, unlike Puppet and SaltStack, leads many teams to build brittle and incomplete automation solutions. We have released under an open-source license our attempt to tackle this problem: Here is a quick demo to configure a new peering:
This work is the collective effort of C dric Hasco t, Jean-Christophe Legatte, Lo c Pailhas, S bastien Hurtel, Tchadel Icard, and Vincent Bernat. We are the network team of Blade, a French company operating Shadow, a cloud-computing product. In May 2021, our company was bought by Octave Klaba and the infrastructure is being transferred to OVHcloud, saving Shadow as a product, but making our team redundant. Our network was around 800 devices, spanning over 10 datacenters with more than 2.5 Tbps of available egress bandwidth. The released material is therefore a substantial example of managing a medium-scale network using Ansible. We have left out the handling of our legacy datacenters to make the final result more readable while keeping enough material to not turn it into a trivial example.

Jerikan The first component is Jerikan. As input, it takes a list of devices, configuration data, templates, and validation scripts. It generates a set of configuration files for each device. Ansible could cover this task, but it has the following limitations:
  • it is slow;
  • errors are difficult to debug;1 and
  • the hierarchy to look up a variable is rigid.
Jerikan inputs and outputs
Jerikan inputs and outputs
If you want to follow the examples, you only need to have Docker and Docker Compose installed. Clone the repository and you are ready!

Source of truth We use YAML files, versioned with Git, as the single source of truth instead of using a database, like NetBox, or a mix of a database and text files. This provides many advantages:
  • anyone can use their preferred text editor;
  • the team prepares changes in branches;
  • the team reviews changes using merge requests;
  • the merge requests expose the changes to the generated configuration files;
  • rollback to a previous state is easy; and
  • it is fast.
The first file is devices.yaml. It contains the device list. The second file is classifier.yaml. It defines a scope for each device. A scope is a set of keys and values. It is used in templates and to look up data associated with a device.
$ ./run-jerikan scope to1-p1.sk1.blade-group.net
continent: apac
environment: prod
groups:
- tor
- tor-bgp
- tor-bgp-compute
host: to1-p1.sk1
location: sk1
member: '1'
model: dell-s4048
os: cumulus
pod: '1'
shorthost: to1-p1
The device name is matched against a list of regular expressions and the scope is extended by the result of each match. For to1-p1.sk1.blade-group.net, the following subset of classifier.yaml defines its scope:
matchers:
  - '^(([^.]*)\..*)\.blade-group\.net':
      environment: prod
      host: '\1'
      shorthost: '\2'
  - '\.(sk1)\.':
      location: '\1'
      continent: apac
  - '^to([12])-[as]?p(\d+)\.':
      member: '\1'
      pod: '\2'
  - '^to[12]-p\d+\.':
      groups:
        - tor
        - tor-bgp
        - tor-bgp-compute
  - '^to[12]-(p ap)\d+\.sk1\.':
      os: cumulus
      model: dell-s4048
The third file is searchpaths.py. It describes which directories to search for a variable. A Python function provides a list of paths to look up in data/ for a given scope. Here is a simplified version:2
def searchpaths(scope):
    paths = [
        "host/ scope[location] / scope[shorthost] ",
        "location/ scope[location] ",
        "os/ scope[os] - scope[model] ",
        "os/ scope[os] ",
        'common'
    ]
    for idx in range(len(paths)):
        try:
            paths[idx] = paths[idx].format(scope=scope)
        except KeyError:
            paths[idx] = None
    return [path for path in paths if path]
With this definition, the data for to1-p1.sk1.blade-group.net is looked up in the following paths:
$ ./run-jerikan scope to1-p1.sk1.blade-group.net
[ ]
Search paths:
  host/sk1/to1-p1
  location/sk1
  os/cumulus-dell-s4048
  os/cumulus
  common
Variables are scoped using a namespace that should be specified when doing a lookup. We use the following ones:
  • system for accounts, DNS, syslog servers,
  • topology for ports, interfaces, IP addresses, subnets,
  • bgp for BGP configuration
  • build for templates and validation scripts
  • apps for application variables
When looking up for a variable in a given namespace, Jerikan looks for a YAML file named after the namespace in each directory in the search paths. For example, if we look up a variable for to1-p1.sk1.blade-group.net in the bgp namespace, the following YAML files are processed: host/sk1/to1-p1/bgp.yaml, location/sk1/bgp.yaml, os/cumulus-dell-s4048/bgp.yaml, os/cumulus/bgp.yaml, and common/bgp.yaml. The search stops at the first match. The schema.yaml file allows us to override this behavior by asking to merge dictionaries and arrays across all matching files. Here is an excerpt of this file for the topology namespace:
system:
  users:
    merge: hash
  sampling:
    merge: hash
  ansible-vars:
    merge: hash
  netbox:
    merge: hash
The last feature of the source of truth is the ability to use Jinja2 templates for keys and values by prefixing them with ~ :
# In data/os/junos/system.yaml
netbox:
  manufacturer: Juniper
  model: "~  model upper  "
# In data/groups/tor-bgp-compute/system.yaml
netbox:
  role: net_tor_gpu_switch
Looking up for netbox in the system namespace for to1-p2.ussfo03.blade-group.net yields the following result:
$ ./run-jerikan scope to1-p2.ussfo03.blade-group.net
continent: us
environment: prod
groups:
- tor
- tor-bgp
- tor-bgp-compute
host: to1-p2.ussfo03
location: ussfo03
member: '1'
model: qfx5110-48s
os: junos
pod: '2'
shorthost: to1-p2
[ ]
Search paths:
[ ]
  groups/tor-bgp-compute
[ ]
  os/junos
  common
$ ./run-jerikan lookup to1-p2.ussfo03.blade-group.net system netbox
manufacturer: Juniper
model: QFX5110-48S
role: net_tor_gpu_switch
This also works for structured data:
# In groups/adm-gateway/topology.yaml
interface-rescue:
  address: "~  lookup('topology', 'addresses').rescue  "
  up:
    - "~ip route add default via   lookup('topology', 'addresses').rescue ipaddr('first_usable')   table rescue"
    - "~ip rule add from   lookup('topology', 'addresses').rescue ipaddr('address')   table rescue priority 10"
# In groups/adm-gateway-sk1/topology.yaml
interfaces:
  ens1f0: "~  lookup('topology', 'interface-rescue')  "
This yields the following result:
$ ./run-jerikan lookup gateway1.sk1.blade-group.net topology interfaces
[ ]
ens1f0:
  address: 121.78.242.10/29
  up:
  - ip route add default via 121.78.242.9 table rescue
  - ip rule add from 121.78.242.10 table rescue priority 10
When putting data in the source of truth, we use the following rules:
  1. Don t repeat yourself.
  2. Put the data in the most specific place without breaking the first rule.
  3. Use templates with parsimony, mostly to help with the previous rules.
  4. Restrict the data model to what is needed for your use case.
The first rule is important. For example, when specifying IP addresses for a point-to-point link, only specify one side and deduce the other value in the templates. The last rule means you do not need to mimic a BGP YANG model to specify BGP peers and policies:
peers:
  transit:
    cogent:
      asn: 174
      remote:
        - 38.140.30.233
        - 2001:550:2:B::1F9:1
      specific-import:
        - name: ATT-US
          as-path: ".*7018$"
          lp-delta: 50
  ix-sfmix:
    rs-sfmix:
      monitored: true
      asn: 63055
      remote:
        - 206.197.187.253
        - 206.197.187.254
        - 2001:504:30::ba06:3055:1
        - 2001:504:30::ba06:3055:2
    blizzard:
      asn: 57976
      remote:
        - 206.197.187.42
        - 2001:504:30::ba05:7976:1
      irr: AS-BLIZZARD

Templates The list of templates to compile for each device is stored in the source of truth, under the build namespace:
$ ./run-jerikan lookup edge1.ussfo03.blade-group.net build templates
data.yaml: data.j2
config.txt: junos/main.j2
config-base.txt: junos/base.j2
config-irr.txt: junos/irr.j2
$ ./run-jerikan lookup to1-p1.ussfo03.blade-group.net build templates
data.yaml: data.j2
config.txt: cumulus/main.j2
frr.conf: cumulus/frr.j2
interfaces.conf: cumulus/interfaces.j2
ports.conf: cumulus/ports.j2
dhcpd.conf: cumulus/dhcp.j2
default-isc-dhcp: cumulus/default-isc-dhcp.j2
authorized_keys: cumulus/authorized-keys.j2
motd: linux/motd.j2
acl.rules: cumulus/acl.j2
rsyslog.conf: cumulus/rsyslog.conf.j2
Templates are using Jinja2. This is the same engine used in Ansible. Jerikan ships some custom filters but also reuse some of the useful filters from Ansible, notably ipaddr. Here is an excerpt of templates/junos/base.j2 to configure DNS and NTP servers on Juniper devices:
system  
  ntp  
 % for ntp in lookup("system", "ntp") % 
    server   ntp  ;
 % endfor % 
   
  name-server  
 % for dns in lookup("system", "dns") % 
      dns  ;
 % endfor % 
   
 
The equivalent template for Cisco IOS-XR is:
 % for dns in lookup('system', 'dns') % 
domain vrf VRF-MANAGEMENT name-server   dns  
 % endfor % 
!
 % for syslog in lookup('system', 'syslog') % 
logging   syslog   vrf VRF-MANAGEMENT
 % endfor % 
!
There are three helper functions provided:
  • devices() returns the list of devices matching a set of conditions on the scope. For example, devices("location==ussfo03", "groups==tor-bgp") returns the list of devices in San Francisco in the tor-bgp group. You can also omit the operator if you want the specified value to be equal to the one in the local scope. For example, devices("location") returns devices in the current location.
  • lookup() does a key lookup. It takes the namespace, the key, and optionally, a device name. If not provided, the current device is assumed.
  • scope() returns the scope of the provided device.
Here is how you would define iBGP sessions between edge devices in the same location:
 % for neighbor in devices("location", "groups==edge") if neighbor != device % 
   % for address in lookup("topology", "addresses", neighbor).loopback tolist % 
protocols bgp group IPV  address ipv  -EDGES-IBGP  
  neighbor   address    
    description "IPv  address ipv  : iBGP to   neighbor  ";
   
 
   % endfor % 
 % endfor % 
We also have a global key-value store to save information to be reused in another template or device. This is quite useful to automatically build DNS records. First, capture the IP address inserted into a template with store() as a filter:
interface Loopback0
 description 'Loopback:'
  % for address in lookup('topology', 'addresses').loopback tolist % 
 ipv  address ipv   address   address store('addresses', 'Loopback0') ipaddr('cidr')  
  % endfor % 
!
Then, reuse it later to build DNS records by iterating over store():4
 % for device, ip, interface in store('addresses') % 
   % set interface = interface replace('/', '-') replace('.', '-') replace(':', '-') % 
   % set name = ' . '.format(interface lower, device) % 
  name  . IN   'A' if ip ipv4 else 'AAAA'     ip ipaddr('address')  
 % endfor % 
Templates are compiled locally with ./run-jerikan build. The --limit argument restricts the devices to generate configuration files for. Build is not done in parallel because a template may depend on the data collected by another template. Currently, it takes 1 minute to compile around 3000 files spanning over 800 devices.
Jerikan outputs when building templates
Output of Jerikan after building configuration files for six devices
When an error occurs, a detailed traceback is displayed, including the template name, the line number and the value of all visible variables. This is a major time-saver compared to Ansible!
templates/opengear/config.j2:15: in top-level template code
    config.interfaces.  interface  .netmask   adddress   ipaddr("netmask")  
        continent  = 'us'
        device     = 'con1-ag2.ussfo03.blade-group.net'
        environment = 'prod'
        host       = 'con1-ag2.ussfo03'
        infos      =  'address': '172.30.24.19/21' 
        interface  = 'wan'
        location   = 'ussfo03'
        loop       = <LoopContext 1/2>
        member     = '2'
        model      = 'cm7132-2-dac'
        os         = 'opengear'
        shorthost  = 'con1-ag2'
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
value = JerkianUndefined, query = 'netmask', version = False, alias = 'ipaddr'
[ ]
        # Check if value is a list and parse each element
        if isinstance(value, (list, tuple, types.GeneratorType)):
            _ret = [ipaddr(element, str(query), version) for element in value]
            return [item for item in _ret if item]
>       elif not value or value is True:
E       jinja2.exceptions.UndefinedError: 'dict object' has no attribute 'adddress'
We don t have general-purpose rules when writing templates. Like for the source of truth, there is no need to create generic templates able to produce any BGP configuration. There is a balance to be found between readability and avoiding duplication. Templates can become scary and complex: sometimes, it s better to write a filter or a function in jerikan/jinja.py. Mastering Jinja2 is a good investment. Take time to browse through our templates as some of them show interesting features.

Checks Optionally, each configuration file can be validated by a script in the checks/ directory. Jerikan looks up the key checks in the build namespace to know which checks to run:
$ ./run-jerikan lookup edge1.ussfo03.blade-group.net build checks
- description: Juniper configuration file syntax check
  script: checks/junoser
  cache:
    input: config.txt
    output: config-set.txt
- description: check YAML data
  script: checks/data.yaml
  cache: data.yaml
In the above example, checks/junoser is executed if there is a change to the generated config.txt file. It also outputs a transformed version of the configuration file which is easier to understand when using diff. Junoser checks a Junos configuration file using Juniper s XML schema definition for Netconf.5 On error, Jerikan displays:
jerikan/build.py:127: RuntimeError
-------------- Captured syntax check with Junoser call --------------
P: checks/junoser edge2.ussfo03.blade-group.net
C: /app/jerikan
O:
E: Invalid syntax:  set system syslog archive size 10m files 10 word-readable
S: 1

Integration into GitLab CI The next step is to compile the templates using a CI. As we are using GitLab, Jerikan ships with a .gitlab-ci.yml file. When we need to make a change, we create a dedicated branch and a merge request. GitLab compiles the templates using the same environment we use on our laptops and store them as an artifact.
Merge requests in GitLab for a change
Merge request to add a new port in USSFO03. The templates were compiled successfully but approval from another team member is still required to merge.
Before approving the merge request, another team member looks at the changes in data and templates but also the differences for the generated configuration files:
Differences for generated configuration files
The change configures a port on the Juniper device, adds records to DNS, and updates NetBox with the new IP addresses (not shown).

Ansible After Jerikan has built the configuration files, Ansible takes over. It is also packaged as a Docker image to avoid the trouble to maintain the right Python virtual environment and ensure everyone is using the same versions.

Inventory Jerikan has generated an inventory file. It contains all the managed devices, the variables defined for each of them and the groups converted to Ansible groups:
ob1-n1.sk1.blade-group.net ansible_host=172.29.15.12 ansible_user=blade ansible_connection=network_cli ansible_network_os=ios
ob2-n1.sk1.blade-group.net ansible_host=172.29.15.13 ansible_user=blade ansible_connection=network_cli ansible_network_os=ios
ob1-n1.ussfo03.blade-group.net ansible_host=172.29.15.12 ansible_user=blade ansible_connection=network_cli ansible_network_os=ios
none ansible_connection=local
[oob]
ob1-n1.sk1.blade-group.net
ob2-n1.sk1.blade-group.net
ob1-n1.ussfo03.blade-group.net
[os-ios]
ob1-n1.sk1.blade-group.net
ob2-n1.sk1.blade-group.net
ob1-n1.ussfo03.blade-group.net
[model-c2960s]
ob1-n1.sk1.blade-group.net
ob2-n1.sk1.blade-group.net
ob1-n1.ussfo03.blade-group.net
[location-sk1]
ob1-n1.sk1.blade-group.net
ob2-n1.sk1.blade-group.net
[location-ussfo03]
ob1-n1.ussfo03.blade-group.net
[in-sync]
ob1-n1.sk1.blade-group.net
ob2-n1.sk1.blade-group.net
ob1-n1.ussfo03.blade-group.net
none
in-sync is a special group for devices which configuration should match the golden configuration. Daily and unattended, Ansible should be able to push configurations to this group. The mid-term goal is to cover all devices. none is a special device for tasks not related to a specific host. This includes synchronizing NetBox, IRR objects, and the DNS, updating the RPKI, and building the geofeed files.

Playbook We use a single playbook for all devices. It is described in the ansible/playbooks/site.yaml file. Here is a shortened version:
- hosts: adm-gateway:!done
  strategy: mitogen_linear
  roles:
    - blade.linux
    - blade.adm-gateway
    - done
- hosts: os-linux:!done
  strategy: mitogen_linear
  roles:
    - blade.linux
    - done
- hosts: os-junos:!done
  gather_facts: false
  roles:
    - blade.junos
    - done
- hosts: os-opengear:!done
  gather_facts: false
  roles:
    - blade.opengear
    - done
- hosts: none:!done
  gather_facts: false
  roles:
    - blade.none
    - done
A host executes only one of the play. For example, a Junos device executes the blade.junos role. Once a play has been executed, the device is added to the done group and the other plays are skipped. The playbook can be executed with the configuration files generated by the GitLab CI using the ./run-ansible-gitlab command. This is a wrapper around Docker and the ansible-playbook command and it accepts the same arguments. To deploy the configuration on the edge devices for the SK1 datacenter in check mode, we use:
$ ./run-ansible-gitlab playbooks/site.yaml --limit='edge:&location-sk1' --diff --check
[ ]
PLAY RECAP *************************************************************
edge1.sk1.blade-group.net  : ok=6    changed=0    unreachable=0    failed=0    skipped=3    rescued=0    ignored=0
edge2.sk1.blade-group.net  : ok=5    changed=0    unreachable=0    failed=0    skipped=1    rescued=0    ignored=0
We have some rules when writing roles:
  • --check must detect if a change is needed;
  • --diff must provide a visualization of the planned changes;
  • --check and --diff must not display anything if there is nothing to change;
  • writing a custom module tailored to our needs is a valid solution;
  • the whole device configuration is managed;6
  • secrets must be stored in Vault;
  • templates should be avoided as we have Jerikan for that; and
  • avoid duplication and reuse tasks.7
We avoid using collections from Ansible Galaxy, the exception being collections to connect and interact with vendor devices, like cisco.iosxr collection. The quality of Ansible Galaxy collections is quite random and it is an additional maintenance burden. It seems better to write roles tailored to our needs. The collections we use are in ci/ansible/ansible-galaxy.yaml. We use Mitogen to get a 10 speedup on Ansible executions on Linux hosts. We also have a few playbooks for operational purpose: upgrading the OS version, isolate an edge router, etc. We were also planning on how to add operational checks in roles: are all the BGP sessions up? They could have been used to validate a deployment and rollback if there is an issue. Currently, our playbooks are run from our laptops. To keep tabs, we are using ARA. A weekly dry-run on devices in the in-sync group also provides a dashboard on which devices we need to run Ansible on.

Configuration data and templates Jerikan ships with pre-populated data and templates matching the configuration of our USSFO03 and SK1 datacenters. They do not exist anymore but, we promise, all this was used in production back in the days!
Network architecture for Blade datacenter
The latest iteration of our network infrastructure for SK1, USSFO03, and future data centers. The production network is using BGPttH using a spine-leaf fabric. The out-of-band network is using a simple L2 design, using the spanning tree protocol, as well as a set of console servers.
Notably, you can find the configuration for:
our edge routers
Some are running on Junos, like edge2.ussfo03, the others on IOS-XR, like edge1.sk1. The implemented functionalities are similar in both cases and we could swap one for the other. It includes the BGP configuration for transits, peerings, and IX as well as the associated BGP policies. PeeringDB is queried to get the maximum number of prefixes to accept for peerings. bgpq3 and a containerized IRRd help to filter received routes. A firewall is added to protect the routing engine. Both IPv4 and IPv6 are configured.
our BGP-based fabric
BGP is used inside the datacenter8 and is extended on bare-metal hosts. The configuration is automatically derived from the device location and the port number.9 Top-of-the-rack devices are using passive BGP sessions for ports towards servers. They are also serving a provisioning network to let them boot using DHCP and PXE. They also act as a DHCP server. The design is multivendor. Some devices are running Cumulus Linux, like to1-p1.ussfo03, while some others are running Junos, like to1-p2.ussfo03.
our out-of-band fabric
We are using Cisco Catalyst 2960 switches to build an L2 out-of-band network. To provide redundancy and saving a few bucks on wiring, we build small loops and run the spanning-tree protocol. See ob1-p1.ussfo03. It is redundantly connected to our gateway servers. We also use OpenGear devices for console access. See con1-n1.ussfo03
our administrative gateways
These Linux servers have multiple purposes: SSH jump boxes, rescue connection, direct access to the out-of-band network, zero-touch provisioning of network devices,10 Internet access for management flows, centralization of the console servers using Conserver, and API for autoconfiguration of BGP sessions for bare-metal servers. They are the first servers installed in a new datacenter and are used to provision everything else. Check both the generated files and the associated Ansible tasks.

  1. Ansible does not even provide a line number when there is an error in a template. You may need to find the problem by bisecting.
    $ ansible --version
    ansible 2.10.8
    [ ]
    $ cat test.j2
    Hello   name  !
    $ ansible all -i localhost, \
    >  --connection=local \
    >  -m template \
    >  -a "src=test.j2 dest=test.txt"
    localhost   FAILED! =>  
        "changed": false,
        "msg": "AnsibleUndefinedVariable: 'name' is undefined"
     
    
  2. You may recognize the same concepts as in Hiera, the hierarchical key-value store from Puppet. At first, we were using Jerakia, a similar independent store exposing an HTTP REST interface. However, the lookup overhead is too large for our use. Jerikan implements the same functionality within a Python function.
  3. The list of available filters is mangled inside jerikan/jinja.py. This is a remain of the fact we do not maintain Jerikan as a standalone software.
  4. This is a bit confusing: we have a store() filter and a store() function. With Jinja2, filters and functions live in two different namespaces.
  5. We are using a fork with some modifications to be able to validate our configurations and exposing an HTTP service to reduce the time spent on each configuration check.
  6. There is a trend in network automation to automate a configuration subset, for example by having a playbook to create a new BGP session. We believe this is wrong: with time, your configuration will get out-of-sync with its expected state, notably hand-made changes will be left undetected.
  7. See ansible/roles/blade.linux/tasks/firewall.yaml and ansible/roles/blade.linux/tasks/interfaces.yaml. They are meant to be called when needed, using import_role.
  8. We also have some datacenters using BGP EVPN VXLAN at medium-scale using Juniper devices. As they are still in production today, we didn t include this feature but we may publish it in the future.
  9. In retrospect, this may not be a good idea unless you are pretty sure everything is uniform (number of switches for each layer, number of ports). This was not our case. We now think it is a better idea to assign a prefix to each device and write it in the source of truth.
  10. Non-linux based devices are upgraded and configured unattended. Cumulus Linux devices are automatically upgraded on install but the final configuration has to be pushed using Ansible: we didn t want to duplicate the configuration process using another tool.

24 May 2021

Vincent Bernat: Transient prompt with Zsh

Powerlevel10k is a theme for Zsh. It contains some powerful features, is astoundingly fast, and easy to customize. I am quite amazed at the skills of its main author. Be sure to also have a look at Zsh for Humans, a complete Zsh configuration including this theme. One of the nice features of Powerlevel10k is transient prompts: past prompts are reduced to a more minimal configuration to save space by removing unneeded information.
Demonstration of a transient prompt with Zsh: past prompts use a more compact form
My implementation of a transient prompt with Zsh. Past prompts are compact and include the time of the command execution, the hostname, and the status of the previous command while the complete prompt contains more information like the current directory and the Git branch.
When it comes to configuring my shell, I still prefer writing and understanding each line going into it. Therefore, I am still building my Zsh configuration from scratch. Here is how I have integrated the above transient feature into my prompt. The first step is to configure the appearance of the prompt in its compact form. Let s assume we have a variable, $_vbe_prompt_compact set to 1 when we want a compact prompt. We use the following function to define the prompt appearance:
_vbe_prompt ()  
    local retval=$?
    # When compact, just time + prompt sign
    if (( $_vbe_prompt_compact )); then
        # Current time (with timezone for remote hosts)
        _vbe_prompt_segment cyan default "%D %H:%M$ SSH_TTY+ %Z  "
        # Hostname for remote hosts
        [[ $SSH_TTY ]] && \
            _vbe_prompt_segment black magenta "%B%M%b"
        # Status of the last command
        if (( $retval )); then
            _vbe_prompt_segment red default $ PRCH[reta] 
        else
            _vbe_prompt_segment green cyan $ PRCH[ok] 
        fi
        # End of prompt
        _vbe_prompt_end
        return
    fi
    # Regular prompt with many information
    # [ ]
 
setopt prompt_subst
PS1='$(_vbe_prompt) '

Update (2021.05) The following part has been rewritten to be more robust. The code is stolen from Powerlevel10k s issue #888. See the comments for more details.

Our next step is to redraw the prompt after accepting a command. We wrap Zsh line editor into a function:1
_vbe-zle-line-init()  
    [[ $CONTEXT == start ]]   return 0
    # Start regular line editor
    (( $+zle_bracketed_paste )) && print -r -n - $zle_bracketed_paste[1]
    zle .recursive-edit
    local -i ret=$?
    (( $+zle_bracketed_paste )) && print -r -n - $zle_bracketed_paste[2]
    # If we received EOT, we exit the shell
    if [[ $ret == 0 && $KEYS == $'\4' ]]; then
        _vbe_prompt_compact=1
        zle .reset-prompt
        exit
    fi
    # Line edition is over. Shorten the current prompt.
    _vbe_prompt_compact=1
    zle .reset-prompt
    unset _vbe_prompt_compact
    if (( ret )); then
        # Ctrl-C
        zle .send-break
    else
        # Enter
        zle .accept-line
    fi
    return ret
 
zle -N zle-line-init _vbe-zle-line-init
That s all!
One downside of using the powerline fonts is that it messes with copy/paste. As I am using tmux, I use the following snippet to work around this issue and use only standard Unicode characters when copying from the terminal:
bind-key -T copy-mode M-w \
  send -X copy-pipe-and-cancel "sed 's/ .* /%/g'   xclip -i -selection clipboard" \;\
  display-message "Selection saved to clipboard!"
Copying and pasting the text from the screenshot above yields the following text:
14:21 % ssh eizo.luffy.cx
Linux eizo 4.19.0-16-amd64 #1 SMP Debian 4.19.181-1 (2021-03-19) x86_64
Last login: Fri Apr 23 14:20:39 2021 from 2a01:cb00:3f:b02:9db6:efa4:d85:7f9f
14:21 CEST % uname -a
Linux eizo 4.19.0-16-amd64 #1 SMP Debian 4.19.181-1 (2021-03-19) x86_64 GNU/Linux
14:21 CEST %
Connection to eizo.luffy.cx closed.
14:22 % git status
On branch article/zsh-transient
Untracked files:
  (use "git add <file>..." to include in what will be committed)
        ../../media/images/zsh-compact-prompt@2x.jpg
nothing added to commit but untracked files present (use "git add" to track)

  1. We have to manually enable bracketed paste because Zsh does it after zle-line-init.

16 February 2021

Michael Prokop: How to properly use 3rd party Debian repository signing keys with apt

(Blogging this, since this is a recurring anti-pattern I noticed at several customers and often comes up during deployments of 3rd party repositories.) Update on 2021-02-19: clarified, that Signed-By requires apt >= 1.1, thanks Vincent Bernat Many upstream projects provide Debian repository instructions like this:
curl -fsSL https://example.com/stable/debian.gpg   sudo apt-key add -
Do not follow this, for different reasons, including:
  1. You do not see what you get before adding the GPG key to your global apt trust store
  2. You can t easily script this via your preferred configuration management (the apt-key manpage clearly discourages programmatic usage)
  3. The signing key is considered valid for all your enabled Debian repositories (instead of only a specific one)
  4. You need GnuPG (either gnupg2 or gnupg1) on your system for usage with apt-key
There s a much better approach to this: download the GPG key, make sure it s in the appropriate format, then use it via deb [signed-by=/usr/share/keyrings/ ] in your apt s sources list configuration. Note and FTR: the Signed-By feature is available starting with apt 1.1 (so apt in Debian jessie/8 and older does not support it). TL;DR: As an example, let s demonstrate this with the Tailscale Debian repository for buster.
Downloading the GPG file will give you an ascii-armored GPG file:
% curl -fsSL -o buster.gpg https://pkgs.tailscale.com/stable/debian/buster.gpg
% gpg --keyid-format long buster.gpg 
gpg: WARNING: no command supplied.  Trying to guess what you mean ...
pub   rsa4096/458CA832957F5868 2020-02-25 [SC]
      2596A99EAAB33821893C0A79458CA832957F5868
uid                           Tailscale Inc. (Package repository signing key) <info@tailscale.com>
sub   rsa4096/B1547A3DDAAF03C6 2020-02-25 [E]
% file buster.gpg
buster.gpg: PGP public key block Public-Key (old)
If you have apt version >= 1.4 available (Debian >=stretch/9 and Ubuntu >=bionic/18.04), you can use this file directly as follows:
% sudo mv buster.gpg /usr/share/keyrings/tailscale.asc
% cat /etc/apt/sources.list.d/tailscale.list
deb [signed-by=/usr/share/keyrings/tailscale.asc] https://pkgs.tailscale.com/stable/debian buster main
% sudo apt update
[...]
And you re done! Iff your apt version really is older than 1.4, you need to convert the ascii-armored GPG file into a GPG key public ring file (AKA binary OpenPGP format), either by just dearmor-ing it (if you don t care about checking ID + fingerprint):
% gpg --dearmor < buster.gpg > tailscale.gpg
or if you prefer to go via GPG, you can also use a temporary GPG home directory (if you don t care about going through your personal GPG setup):
% mkdir --mode=700 /tmp/gpg-tmpdir
% gpg --homedir /tmp/gpg-tmpdir --import ./buster.gpg
gpg: keybox '/tmp/gpg-tmpdir/pubring.kbx' created
gpg: /tmp/gpg-tmpdir/trustdb.gpg: trustdb created
gpg: key 458CA832957F5868: public key "Tailscale Inc. (Package repository signing key) <info@tailscale.com>" imported
gpg: Total number processed: 1
gpg:               imported: 1
% gpg --homedir /tmp/gpg-tmpdir --output tailscale.gpg  --export-options=export-minimal --export 0x458CA832957F5868
% rm -rf /tmp/gpg-tmpdir
The resulting GPG key public ring file should look like that:
% file tailscale.gpg 
tailscale.gpg: PGP/GPG key public ring (v4) created Tue Feb 25 04:51:20 2020 RSA (Encrypt or Sign) 4096 bits MPI=0xc00399b10bc12858...
% gpg tailscale.gpg 
gpg: WARNING: no command supplied.  Trying to guess what you mean ...
pub   rsa4096/458CA832957F5868 2020-02-25 [SC]
      2596A99EAAB33821893C0A79458CA832957F5868
uid                           Tailscale Inc. (Package repository signing key) <info@tailscale.com>
sub   rsa4096/B1547A3DDAAF03C6 2020-02-25 [E]
Then you can use this GPG file on your system as follows:
% sudo mv tailscale.gpg /usr/share/keyrings/tailscale.gpg
% cat /etc/apt/sources.list.d/tailscale.list
deb [signed-by=/usr/share/keyrings/tailscale.gpg] https://pkgs.tailscale.com/stable/debian buster main
% sudo apt update
[...]
Such a setup ensures:
  1. You can verify the GPG key file (ID + fingerprint)
  2. You can easily ship files via /usr/share/keyrings/ and refer to it in your deployment scripts, configuration management, (and can also easily update or get rid of them again!)
  3. The GPG key is valid only for the repositories with the corresponding [signed-by=/usr/share/keyrings/ ] entry
  4. You don t need to install GnuPG (neither gnupg2 nor gnupg1) on the system which is using the 3rd party Debian repository
Thanks: Guillem Jover for reviewing an early draft of this blog article.

17 January 2021

Wouter Verhelst: Software available through Extrepo

Just over 7 months ago, I blogged about extrepo, my answer to the "how do you safely install software on Debian without downloading random scripts off the Internet and running them as root" question. I also held a talk during the recent "MiniDebConf Online" that was held, well, online. The most important part of extrepo is "what can you install through it". If the number of available repositories is too low, there's really no reason to use it. So, I thought, let's look what we have after 7 months... To cut to the chase, there's a bunch of interesting content there, although not all of it has a "main" policy. Each of these can be enabled by installing extrepo, and then running extrepo enable <reponame>, where <reponame> is the name of the repository. Note that the list is not exhaustive, but I intend to show that even though we're nowhere near complete, extrepo is already quite useful in its current state:

Free software
  • The debian_official, debian_backports, and debian_experimental repositories contain Debian's official, backports, and experimental repositories, respectively. These shouldn't have to be managed through extrepo, but then again it might be useful for someone, so I decided to just add them anyway. The config here uses the deb.debian.org alias for CDN-backed package mirrors.
  • The belgium_eid repository contains the Belgian eID software. Obviously this is added, since I'm upstream for eID, and as such it was a large motivating factor for me to actually write extrepo in the first place.
  • elastic: the elasticsearch software.
  • Some repositories, such as dovecot, winehq and bareos contain upstream versions of their respective software. These two repositories contain software that is available in Debian, too; but their upstreams package their most recent release independently, and some people might prefer to run those instead.
  • The sury, fai, and postgresql repositories, as well as a number of repositories such as openstack_rocky, openstack_train, haproxy-1.5 and haproxy-2.0 (there are more) contain more recent versions of software packaged in Debian already by the same maintainer of that package repository. For the sury repository, that is PHP; for the others, the name should give it away. The difference between these repositories and the ones above is that it is the official Debian maintainer for the same software who maintains the repository, which is not the case for the others.
  • The vscodium repository contains the unencumbered version of Microsoft's Visual Studio Code; i.e., the codium version of Visual Studio Code is to code as the chromium browser is to chrome: it is a build of the same softare, but without the non-free bits that make code not entirely Free Software.
  • While Debian ships with at least two browsers (Firefox and Chromium), additional browsers are available through extrepo, too. The iridiumbrowser repository contains a Chromium-based browser that focuses on privacy.
  • Speaking of privacy, perhaps you might want to try out the torproject repository.
  • For those who want to do Cloud Computing on Debian in ways that isn't covered by Openstack, there is a kubernetes repository that contains the Kubernetes stack, the as well as the google_cloud one containing the Google Cloud SDK.

Non-free software While these are available to be installed through extrepo, please note that non-free and contrib repositories are disabled by default. In order to enable these repositories, you must first enable them; this can be accomplished through /etc/extrepo/config.yaml.
  • In case you don't care about freedom and want the official build of Visual Studio Code, the vscode repository contains it.
  • While we're on the subject of Microsoft, there's also Microsoft Teams available in the msteams repository. And, hey, skype.
  • For those who are not satisfied with the free browsers in Debian or any of the free repositories, there's opera and google_chrome.
  • The docker-ce repository contains the official build of Docker CE. While this is the free "community edition" that should have free licenses, I could not find a licensing statement anywhere, and therefore I'm not 100% sure whether this repository is actually free software. For that reason, it is currently marked as a non-free one. Merge Requests for rectifying that from someone with more information on the actual licensing situation of Docker CE would be welcome...
  • For gamers, there's Valve's steam repository.
Again, the above lists are not meant to be exhaustive. Special thanks go out to Russ Allbery, Kim Alvefur, Vincent Bernat, Nick Black, Arnaud Ferraris, Thorsten Glaser, Thomas Goirand, Juri Grabowski, Paolo Greppi, and Josh Triplett, for helping me build the current list of repositories. Is your favourite repository not listed? Create a configuration based on template.yaml, and file a merge request!

15 November 2020

Vincent Bernat: Zero-Touch Provisioning for Juniper

Juniper s official documentation on ZTP explains how to configure the ISC DHCP Server to automatically upgrade and configure on first boot a Juniper device. However, the proposed configuration could be a bit more elegant. This note explains how.

TL;DR Do not redefine option 43. Instead, specify the vendor option space to use to encode parameters with vendor-option-space.


When booting for the first time, a Juniper device requests its IP address through a DHCP discover message, then request additional parameters for autoconfiguration through a DHCP request message:
Dynamic Host Configuration Protocol (Request)
    Message type: Boot Request (1)
    Hardware type: Ethernet (0x01)
    Hardware address length: 6
    Hops: 0
    Transaction ID: 0x44e3a7c9
    Seconds elapsed: 0
    Bootp flags: 0x8000, Broadcast flag (Broadcast)
    Client IP address: 0.0.0.0
    Your (client) IP address: 0.0.0.0
    Next server IP address: 0.0.0.0
    Relay agent IP address: 0.0.0.0
    Client MAC address: 02:00:00:00:00:01 (02:00:00:00:00:01)
    Client hardware address padding: 00000000000000000000
    Server host name not given
    Boot file name not given
    Magic cookie: DHCP
    Option: (54) DHCP Server Identifier (10.0.2.2)
    Option: (55) Parameter Request List
        Length: 14
        Parameter Request List Item: (3) Router
        Parameter Request List Item: (51) IP Address Lease Time
        Parameter Request List Item: (1) Subnet Mask
        Parameter Request List Item: (15) Domain Name
        Parameter Request List Item: (6) Domain Name Server
        Parameter Request List Item: (66) TFTP Server Name
        Parameter Request List Item: (67) Bootfile name
        Parameter Request List Item: (120) SIP Servers
        Parameter Request List Item: (44) NetBIOS over TCP/IP Name Server
        Parameter Request List Item: (43) Vendor-Specific Information
        Parameter Request List Item: (150) TFTP Server Address
        Parameter Request List Item: (12) Host Name
        Parameter Request List Item: (7) Log Server
        Parameter Request List Item: (42) Network Time Protocol Servers
    Option: (50) Requested IP Address (10.0.2.15)
    Option: (53) DHCP Message Type (Request)
    Option: (60) Vendor class identifier
        Length: 15
        Vendor class identifier: Juniper-mx10003
    Option: (51) IP Address Lease Time
    Option: (12) Host Name
    Option: (255) End
    Padding: 00
It requests several options, including the TFTP server address option 150, and the Vendor-Specific Information Option 43 or VSIO. The DHCP server can use option 60 to identify the vendor-specific information to send. For Juniper devices, option 43 encodes the image name and the configuration file name. They are fetched from the IP address provided in option 150. The official documentation on ZTP provides a valid configuration to answer such a request. However, it does not leverage the ability of the ISC DHCP Server to support several vendors and redefines option 43 to be Juniper-specific:
option NEW_OP-encapsulation code 43 = encapsulate NEW_OP;
Instead, it is possible to define an option space for Juniper, using a self-descriptive name, without overriding option 43:
# Juniper vendor option space
option space juniper;
option juniper.image-file-name     code 0 = text;
option juniper.config-file-name    code 1 = text;
option juniper.image-file-type     code 2 = text;
option juniper.transfer-mode       code 3 = text;
option juniper.alt-image-file-name code 4 = text;
option juniper.http-port           code 5 = text;
Then, when you need to set these suboptions, specify the vendor option space:
class "juniper-mx10003"  
  match if (option vendor-class-identifier = "Juniper-mx10003")  
  vendor-option-space juniper;
  option juniper.transfer-mode    "http";
  option juniper.image-file-name  "/images/junos-vmhost-install-mx-x86-64-19.3R2-S4.5.tgz";
  option juniper.config-file-name "/cfg/juniper-mx10003.txt";
 
This configuration returns the following answer:1
Dynamic Host Configuration Protocol (ACK)
    Message type: Boot Reply (2)
    Hardware type: Ethernet (0x01)
    Hardware address length: 6
    Hops: 0
    Transaction ID: 0x44e3a7c9
    Seconds elapsed: 0
    Bootp flags: 0x8000, Broadcast flag (Broadcast)
    Client IP address: 0.0.0.0
    Your (client) IP address: 10.0.2.15
    Next server IP address: 0.0.0.0
    Relay agent IP address: 0.0.0.0
    Client MAC address: 02:00:00:00:00:01 (02:00:00:00:00:01)
    Client hardware address padding: 00000000000000000000
    Server host name not given
    Boot file name not given
    Magic cookie: DHCP
    Option: (53) DHCP Message Type (ACK)
    Option: (54) DHCP Server Identifier (10.0.2.2)
    Option: (51) IP Address Lease Time
    Option: (1) Subnet Mask (255.255.255.0)
    Option: (3) Router
    Option: (6) Domain Name Server
    Option: (43) Vendor-Specific Information
        Length: 89
        Value: 00362f696d616765732f6a756e6f732d766d686f73742d69 
    Option: (150) TFTP Server Address
    Option: (255) End
Using vendor-option-space directive allows you to make different ZTP implementations coexist. For example, you can add the option space for PXE:
option space PXE;
option PXE.mtftp-ip    code 1 = ip-address;
option PXE.mtftp-cport code 2 = unsigned integer 16;
option PXE.mtftp-sport code 3 = unsigned integer 16;
option PXE.mtftp-tmout code 4 = unsigned integer 8;
option PXE.mtftp-delay code 5 = unsigned integer 8;
option PXE.discovery-control    code 6  = unsigned integer 8;
option PXE.discovery-mcast-addr code 7  = ip-address;
option PXE.boot-server code 8  =   unsigned integer 16, unsigned integer 8, ip-address  ;
option PXE.boot-menu   code 9  =   unsigned integer 16, unsigned integer 8, text  ;
option PXE.menu-prompt code 10 =   unsigned integer 8, text  ;
option PXE.boot-item   code 71 = unsigned integer 32;
class "pxeclients"  
  match if substring (option vendor-class-identifier, 0, 9) = "PXEClient";
  vendor-option-space PXE;
  option PXE.mtftp-ip 10.0.2.2;
  # [ ]
 
On the same topic, do not override option 125 VIVSO. See Zero-Touch Provisioning for Cisco IOS.

  1. Wireshark knows how to decode option 43 for some vendors, thanks to option 60, but not for Juniper.

2 November 2020

Vincent Bernat: My collection of vintage PC cards

Recently, I have been gathering some old hardware at my parents house, notably PC extension cards, as they don t take much room and can be converted to a nice display item. Unfortunately, I was not very concerned about keeping stuff around. Compared to all the hardware I have acquired over the years, only a few pieces remain.

Tseng Labs ET4000AX (1989) This SVGA graphics card was installed into a PC powered by a 386SX CPU running at 16 MHz. This was a good card at the time as it was pretty fast. It didn t feature 2D acceleration, unlike the later ET4000/W32. This version only features 512 KB of RAM. It can display 1024 768 images with 16 colors or 800 600 with 256 colors. It was also compatible with CGA, EGA, VGA, MDA, and Hercules modes. No contemporary games were using the SVGA modes but the higher resolutions were useful with Windows 3. This card was manufactured directly by Tseng Labs.
Carte Tseng Labs ET4000AX ISA au-dessus de la bo te "Plan te Aventure"
Tseng Labs ET4000 AX ISA card

AdLib clone (1992) My first sound card was an AdLib. My parents bought it in Canada during the summer holidays in 1992. It uses a Yamaha OPL2 chip to produce sound via FM synthesis. The first game I have tried is Indiana Jones and the Last Crusade. I think I gave this AdLib to a friend once I upgraded my PC with a Sound Blaster Pro 2. Recently, I needed one for a side project, but they are rare and expensive on eBay. Someone mentioned a cheap clone on Vogons, so I bought it. It was sold by Sun Moon Star in 1992 and shipped with a CD-ROM of Doom shareware.
AdLib clone on top of "Alone in the Dark" box
AdLib clone ISA card by Sun Moon Star
On this topic, take a look at OPL2LPT: an AdLib sound card for the parallel port and OPL2 Audio Board: an AdLib sound card for Arduino .

Sound Blaster Pro 2 (1992) Later, I switched the AdLib sound card with a Sound Blaster Pro 2. It features an OPL3 chip and was also able to output digital samples. At the time, this was a welcome addition, but not as important as the FM synthesis introduced earlier by the AdLib.
Sound Blaster Pro 2 on top of "Day of the Tentacle" box
Sound Blaster Pro 2 ISA card

Promise EIDE 2300 Plus (1995) I bought this card mostly for the serial port. I was using a 486DX2 running at 66 MHz with a Creatix LC 288 FC external modem. The serial port was driven by an 8250 UART with no buffer. Thanks to Terminate, I was able to connect to BBSes with DOS, but this was not possible with Windows 3 or OS/2. I needed one of these fancy new cards with a 16550 UART, featuring a 16-byte buffer. At the time, this was quite difficult to find in France. During a holiday trip, I convinced my parent to make a short detour from Los Angeles to San Diego to buy this Promise EIDE 2300 Plus controller card at a shop I located through an advertisement in a local magazine! The card also features an EIDE controller with multi-word DMA mode 2 support. In contrast with the older PIO modes, the CPU didn t have to copy data from disk to memory.
Promise EIDE 2300 Plus next to an OS/2 Warp CD
Promise EIDE 2300 Plus VLB card

3dfx Voodoo2 Magic 3D II (1998) The 3dfx Voodoo2 was one of the first add-in graphics cards implementing hardware acceleration of 3D graphics. I bought it from a friend along with his Pentium II box in 1999. It was a big evolutionary step in PC gaming, as games became more beautiful and fluid. A traditional video controller was still required for 2D. A pass-through VGA cable daisy-chained the video controller to the Voodoo, which was itself connected to the monitor.
3dfx Voodoo 2 Magic 3D II on top of "Jedi Knight: Dark Forces II" box
3dfx Voodoo2 Magic 3D II PCI card

3Com 3C905C-TX-M Tornado (1999) In the early 2000s, in college, the Internet connection on the campus was provided by a student association through a 100 Mbps Ethernet cable. If you wanted to reach the maximum speed, the 3Com 3C905C-TX-M PCI network adapter, nicknamed Tornado , was the card you needed. We would buy it second-hand by the dozen and sell them to other students for around 30 .
3COM 3C905C-TX-M on top of "Red Alert" box
3Com 3C905C-TX-M PCI card

1 November 2020

Vincent Bernat: Running Isso on NixOS in a Docker container

This short article documents how I run Isso, the commenting system used by this blog, inside a Docker container on NixOS, a Linux distribution built on top of Nix. Nix is a declarative package manager for Linux and other Unix systems.
While NixOS 20.09 includes a derivation for Isso, it is unfortunately broken and relies on Python 2. As I am also using a fork of Isso, I have built my own derivation, heavily inspired by the one in master:1
issoPackage = with pkgs.python3Packages; buildPythonPackage rec  
  pname = "isso";
  version = "custom";
  src = pkgs.fetchFromGitHub  
    # Use my fork
    owner = "vincentbernat";
    repo = pname;
    rev = "vbe/master";
    sha256 = "0vkkvjcvcjcdzdj73qig32hqgjly8n3ln2djzmhshc04i6g9z07j";
   ;
  propagatedBuildInputs = [
    itsdangerous
    jinja2
    misaka
    html5lib
    werkzeug
    bleach
    flask-caching
  ];
  buildInputs = [
    cffi
  ];
  checkInputs = [ nose ];
  checkPhase = ''
    $ python.interpreter  setup.py nosetests
  '';
 ;
I want to run Isso through Gunicorn. To this effect, I build a Python environment combining Isso and Gunicorn. Then, I can invoke the latter with "$ issoEnv /bin/gunicorn", like with a virtual environment.
issoEnv = pkgs.python3.buildEnv.override  
    extraLibs = [
      issoPackage
      pkgs.python3Packages.gunicorn
      pkgs.python3Packages.gevent
    ];
 ;
Before building a Docker image, I also need to specify the Isso configuration file for Isso:
issoConfig = pkgs.writeText "isso.conf" ''
  [general]
  dbpath = /db/comments.db
  host =
    https://vincent.bernat.ch
    http://localhost:8080
  notify = smtp
  [ ]
'';
NixOS comes with a convenient tool to build a Docker image without a Dockerfile:
issoDockerImage = pkgs.dockerTools.buildImage  
  name = "isso";
  tag = "latest";
  extraCommands = ''
    mkdir -p db
  '';
  config =  
    Cmd = [ "$ issoEnv /bin/gunicorn"
            "--name" "isso"
            "--bind" "0.0.0.0:$ port "
            "--worker-class" "gevent"
            "--workers" "2"
            "--worker-tmp-dir" "/dev/shm"
            "--preload"
            "isso.run"
          ];
    Env = [
      "ISSO_SETTINGS=$ issoConfig "
      "SSL_CERT_FILE=$ pkgs.cacert /etc/ssl/certs/ca-bundle.crt"
    ];
   ;
 ;
Because we refer to the issoEnv derivation in config.Cmd, the whole derivation, including Isso and Gunicorn, is copied inside the Docker image. The same applies for issoConfig, the configuration file we created earlier, and pkgs.cacert, the derivation containing trusted root certificates. The resulting image is 171 MB once installed, which is comparable to the Debian Buster image generated by the official Dockerfile. NixOS features an abstraction to run Docker containers. It is not currently documented in NixOS manual but you can look at the source code of the module for the available options. I choose to use Podman instead of Docker as the backend because it does not require running an additional daemon.
virtualisation.oci-containers =  
  backend = "podman";
  containers =  
    isso =  
      image = "isso";
      imageFile = issoDockerImage;
      ports = ["127.0.0.1:$ port :$ port "];
      volumes = [
        "/var/db/isso:/db"
      ];
     ;
   ;
 ;
A systemd unit file is automatically created to run and supervise the container:
$ systemctl status podman-isso.service
  podman-isso.service
     Loaded: loaded (/nix/store/a66gzqqwcdzbh99sz8zz5l5xl8r8ag7w-unit->
     Active: active (running) since Sun 2020-11-01 16:04:16 UTC; 4min 44s ago
    Process: 14564 ExecStartPre=/nix/store/95zfn4vg4867gzxz1gw7nxayqcl>
   Main PID: 14697 (podman)
         IP: 0B in, 0B out
      Tasks: 10 (limit: 2313)
     Memory: 221.3M
        CPU: 10.058s
     CGroup: /system.slice/podman-isso.service
              14697 /nix/store/pn52xgn1wb2vr2kirq3xj8ij0rys35mf-podma>
              14802 /nix/store/7vsba54k6ag4cfsfp95rvjzqf6rab865-conmo>
nov. 01 16:04:17 web03 podman[14697]: container init (image=localhost/isso:latest)
nov. 01 16:04:17 web03 podman[14697]: container start (image=localhost/isso:latest)
nov. 01 16:04:17 web03 podman[14697]: container attach (image=localhost/isso:latest)
nov. 01 16:04:19 web03 conmon[14802]: INFO: connected to SMTP server
nov. 01 16:04:19 web03 conmon[14802]: INFO: connected to https://vincent.bernat.ch
nov. 01 16:04:19 web03 conmon[14802]: [INFO] Starting gunicorn 20.0.4
nov. 01 16:04:19 web03 conmon[14802]: [INFO] Listening at: http://0.0.0.0:8080 (1)
nov. 01 16:04:19 web03 conmon[14802]: [INFO] Using worker: gevent
nov. 01 16:04:19 web03 conmon[14802]: [INFO] Booting worker with pid: 8
nov. 01 16:04:19 web03 conmon[14802]: [INFO] Booting worker with pid: 9
As the last step, we configure Nginx to forward requests for comments.luffy.cx to the container. NixOS provides a simple integration to grab a Let s Encrypt certificate.
services.nginx.virtualHosts."comments.luffy.cx" =  
  root = "/data/webserver/comments.luffy.cx";
  enableACME = true;
  forceSSL = true;
  extraConfig = ''
    access_log /var/log/nginx/comments.luffy.cx.log anonymous;
  '';
  locations."/" =  
    proxyPass = "http://127.0.0.1:$ port ";
    extraConfig = ''
      proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
      proxy_set_header Host $host;
      proxy_set_header X-Forwarded-Proto $scheme;
      proxy_hide_header Set-Cookie;
      proxy_hide_header X-Set-Cookie;
      proxy_ignore_headers Set-Cookie;
    '';
   ;
 ;
security.acme.certs."comments.luffy.cx" =  
  email = lib.concatStringsSep "@" [ "letsencrypt" "vincent.bernat.ch" ];
 ;

While I still struggle with Nix and NixOS, I am convinced this is how declarative infrastructure should be done. I like how in one single file, I can define the derivation to build Isso, the configuration, the Docker image, the container definition, and the Nginx configuration. The Nix language is used both for building packages and for managing configurations. Moreover, the Docker image is updated automatically like a regular NixOS host. This solves an issue plaguing the Docker ecosystem: no more stale images! My next step would be to combine this approach with Nomad, a simple orchestrator to deploy and manage containers.

  1. There is a subtle difference: I am using buildPythonPackage instead of buildPythonApplication. This is important for the next step. I didn t investigate if an application can be converted to a package easily.

29 September 2020

Vincent Bernat: Speeding up bgpq4 with IRRd in a container

When building route filters with bgpq4 or bgpq3, the speed of rr.ntt.net or whois.radb.net can be a bottleneck. Updating many filters may take several tens of minutes, depending on the load:
$ time bgpq4 -h whois.radb.net AS-HURRICANE   wc -l
909869
1.96s user 0.15s system 2% cpu 1:17.64 total
$ time bgpq4 -h rr.ntt.net AS-HURRICANE   wc -l
927865
1.86s user 0.08s system 12% cpu 14.098 total
A possible solution is to run your own IRRd instance in your network, mirroring the main routing registries. A close alternative is to bundle IRRd with all the data in a ready-to-use Docker image. This also has the advantage of easy integration into a Docker-based CI/CD pipeline.
$ git clone https://github.com/vincentbernat/irrd-legacy.git -b blade/master
$ cd irrd-legacy
$ docker build . -t irrd-snapshot:latest
[ ]
Successfully built 58c3e83a1d18
Successfully tagged irrd-snapshot:latest
$ docker container run --rm --detach --publish=43:43 irrd-snapshot
4879cfe7413075a0c217089dcac91ed356424c6b88808d8fcb01dc00eafcc8c7
$ time bgpq4 -h localhost AS-HURRICANE   wc -l
904137
1.72s user 0.11s system 96% cpu 1.881 total
The Dockerfile contains three stages:
  1. building IRRd,1
  2. retrieving various IRR databases, and
  3. assembling the final container with the result of the two previous stages.
The second stage fetches the databases used by rr.ntt.net: NTTCOM, RADB, RIPE, ALTDB, BELL, LEVEL3, RGNET, APNIC, JPIRR, ARIN, BBOI, TC, AFRINIC, ARIN-WHOIS, and REGISTROBR. However, it misses RPKI.2 Feel free to adapt! The image can be scheduled to be rebuilt daily or weekly, depending on your needs. The repository includes a .gitlab-ci.yaml file automating the build and triggering the compilation of all filters by your CI/CD upon success.

  1. Instead of using the latest version of IRRd, the image relies on an older version that does not require a PostgreSQL instance and uses flat files instead.
  2. Unlike the others, the RPKI database is built from the published RPKI ROAs. They can be retrieved with rpki-client and transformed into RPSL objects to be imported in IRRd.

28 September 2020

Vincent Bernat: Syncing RIPE, ARIN and APNIC objects with a custom Ansible module

Internet is split into five regional Internet registry: AFRINIC, ARIN, APNIC, LACNIC and RIPE. Each RIR maintains an Internet Routing Registry. An IRR allows one to publish information about the routing of Internet number resources.1 Operators use this to determine the owner of an IP address and to construct and maintain routing filters. To ensure your routes are widely accepted, it is important to keep the prefixes you announce up-to-date in an IRR. There are two common tools to query this database: whois and bgpq4. The first one allows you to do a query with the WHOIS protocol:
$ whois -BrG 2a0a:e805:400::/40
[ ]
inet6num:       2a0a:e805:400::/40
netname:        FR-BLADE-CUSTOMERS-DE
country:        DE
geoloc:         50.1109 8.6821
admin-c:        BN2763-RIPE
tech-c:         BN2763-RIPE
status:         ASSIGNED
mnt-by:         fr-blade-1-mnt
remarks:        synced with cmdb
created:        2020-05-19T08:04:58Z
last-modified:  2020-05-19T08:04:58Z
source:         RIPE
route6:         2a0a:e805:400::/40
descr:          Blade IPv6 - AMS1
origin:         AS64476
mnt-by:         fr-blade-1-mnt
remarks:        synced with cmdb
created:        2019-10-01T08:19:34Z
last-modified:  2020-05-19T08:05:00Z
source:         RIPE
The second one allows you to build route filters using the information contained in the IRR database:
$ bgpq4 -6 -S RIPE -b AS64476
NN = [
    2a0a:e805::/40,
    2a0a:e805:100::/40,
    2a0a:e805:300::/40,
    2a0a:e805:400::/40,
    2a0a:e805:500::/40
];
There is no module available on Ansible Galaxy to manage these objects. Each IRR has different ways of being updated. Some RIRs propose an API but some don t. If we restrict ourselves to RIPE, ARIN and APNIC, the only common method to update objects is email updates, authenticated with a password or a GPG signature.2 Let s write a custom Ansible module for this purpose!

Notice I recommend that you read Writing a custom Ansible module as an introduction, as well as Syncing MySQL tables for a more instructive example.

Code The module takes a list of RPSL objects to synchronize and returns the body of an email update if a change is needed:
- name: prepare RIPE objects
  irr_sync:
    irr: RIPE
    mntner: fr-blade-1-mnt
    source: whois-ripe.txt
  register: irr

Prerequisites The source file should be a set of objects to sync using the RPSL language. This would be the same content you would send manually by email. All objects should be managed by the same maintainer, which is also provided as a parameter. Signing and sending the result is not the responsibility of this module. You need two additional tasks for this purpose:
- name: sign RIPE objects
  shell:
    cmd: gpg --batch --user noc@example.com --clearsign
    stdin: "  irr.objects  "
  register: signed
  check_mode: false
  changed_when: false
- name: update RIPE objects by email
  mail:
    subject: "NEW: update for RIPE"
    from: noc@example.com
    to: "auto-dbm@ripe.net"
    cc: noc@example.com
    host: smtp.example.com
    port: 25
    charset: us-ascii
    body: "  signed.stdout  "
You also need to authorize the PGP keys used to sign the updates by creating a key-cert object and adding it as a valid authentication method for the corresponding mntner object:
key-cert:  PGPKEY-A791AAAB
certif:    -----BEGIN PGP PUBLIC KEY BLOCK-----
certif:    
certif:    mQGNBF8TLY8BDADEwP3a6/vRhEERBIaPUAFnr23zKCNt5YhWRZyt50mKq1RmQBBY
[ ]
certif:    -----END PGP PUBLIC KEY BLOCK-----
mnt-by:    fr-blade-1-mnt
source:    RIPE
mntner:    fr-blade-1-mnt
[ ]
auth:      PGPKEY-A791AAAB
mnt-by:    fr-blade-1-mnt
source:    RIPE

Module definition Starting from the skeleton described in the previous article, we define the module:
module_args = dict(
    irr=dict(type='str', required=True),
    mntner=dict(type='str', required=True),
    source=dict(type='path', required=True),
)
result = dict(
    changed=False,
)
module = AnsibleModule(
    argument_spec=module_args,
    supports_check_mode=True
)

Getting existing objects To grab existing objects, we use the whois command to retrieve all the objects from the provided maintainer.
# Per-IRR variations:
# - whois server
whois =  
    'ARIN': 'rr.arin.net',
    'RIPE': 'whois.ripe.net',
    'APNIC': 'whois.apnic.net'
 
# - whois options
options =  
    'ARIN': ['-r'],
    'RIPE': ['-BrG'],
    'APNIC': ['-BrG']
 
# - objects excluded from synchronization
excluded = ["domain"]
if irr == "ARIN":
    # ARIN does not return these objects
    excluded.extend([
        "key-cert",
        "mntner",
    ])
# Grab existing objects
args = ["-h", whois[irr],
        "-s", irr,
        *options[irr],
        "-i", "mnt-by",
        module.params['mntner']]
proc = subprocess.run("whois", *args, capture_output=True)
if proc.returncode != 0:
    raise AnsibleError(
        f"unable to query whois:  args ")
output = proc.stdout.decode('ascii')
got = extract(output, excluded)
The first part of the code setup some IRR-specific constants: the server to query, the options to provide to the whois command and the objects to exclude from synchronization. The second part invokes the whois command, requesting all objects whose mnt-by field is the provided maintainer. Here is an example of output:
$ whois -h whois.ripe.net -s RIPE -BrG -i mnt-by fr-blade-1-mnt
[ ]
inet6num:       2a0a:e805:300::/40
netname:        FR-BLADE-CUSTOMERS-FR
country:        FR
geoloc:         48.8566 2.3522
admin-c:        BN2763-RIPE
tech-c:         BN2763-RIPE
status:         ASSIGNED
mnt-by:         fr-blade-1-mnt
remarks:        synced with cmdb
created:        2020-05-19T08:04:59Z
last-modified:  2020-05-19T08:04:59Z
source:         RIPE
[ ]
route6:         2a0a:e805:300::/40
descr:          Blade IPv6 - PA1
origin:         AS64476
mnt-by:         fr-blade-1-mnt
remarks:        synced with cmdb
created:        2019-10-01T08:19:34Z
last-modified:  2020-05-19T08:05:00Z
source:         RIPE
[ ]
The result is passed to the extract() function. It parses and normalizes the results into a dictionary mapping object names to objects. We store the result in the got variable.
def extract(raw, excluded):
    """Extract objects."""
    # First step, remove comments and unwanted lines
    objects = "\n".join([obj
                         for obj in raw.split("\n")
                         if not obj.startswith((
                                 "#",
                                 "%",
                         ))])
    # Second step, split objects
    objects = [RPSLObject(obj.strip())
               for obj in re.split(r"\n\n+", objects)
               if obj.strip()
               and not obj.startswith(
                   tuple(f" x :" for x in excluded))]
    # Last step, put objects in a dict
    objects =  repr(obj): obj
               for obj in objects 
    return objects
RPSLObject() is a class enabling normalization and comparison of objects. Look at the module code for more details.
>>> output="""
... inet6num:       2a0a:e805:300::/40
... [ ]
... """
>>> pprint( k: str(v) for k,v in extract(output, excluded=[]) )
 '<Object:inet6num:2a0a:e805:300::/40>':
   'inet6num:       2a0a:e805:300::/40\n'
   'netname:        FR-BLADE-CUSTOMERS-FR\n'
   'country:        FR\n'
   'geoloc:         48.8566 2.3522\n'
   'admin-c:        BN2763-RIPE\n'
   'tech-c:         BN2763-RIPE\n'
   'status:         ASSIGNED\n'
   'mnt-by:         fr-blade-1-mnt\n'
   'remarks:        synced with cmdb\n'
   'source:         RIPE',
 '<Object:route6:2a0a:e805:300::/40>':
   'route6:         2a0a:e805:300::/40\n'
   'descr:          Blade IPv6 - PA1\n'
   'origin:         AS64476\n'
   'mnt-by:         fr-blade-1-mnt\n'
   'remarks:        synced with cmdb\n'
   'source:         RIPE' 

Comparing with wanted objects Let s build the wanted dictionary using the same structure, thanks to the extract() function we can use verbatim:
with open(module.params['source']) as f:
    source = f.read()
wanted = extract(source, excluded)
The next step is to compare got and wanted to build the diff object:
if got != wanted:
    result['changed'] = True
    if module._diff:
        result['diff'] = [
            dict(before_header=k,
                 after_header=k,
                 before=str(got.get(k, "")),
                 after=str(wanted.get(k, "")))
            for k in set((*wanted.keys(), *got.keys()))
            if k not in wanted or k not in got or wanted[k] != got[k]]

Returning updates The module does not have a side effect. If there is a difference, we return the updates to send by email. We choose to include all wanted objects in the updates (contained in the source variable) and let the IRR ignore unmodified objects. We also append the objects to be deleted by adding a delete: attribute to each them them.
# We send all source objects and deleted objects.
deleted_mark = f" 'delete:':16 deleted by CMDB"
deleted = "\n\n".join([f" got[k].raw \n deleted_mark "
                       for k in got
                       if k not in wanted])
result['objects'] = f" source \n\n deleted "
module.exit_json(**result)

The complete code is available on GitHub. The module supports both --diff and --check flags. It does not return anything if no change is detected. It can work with APNIC, RIPE and ARIN. It is not perfect: it may not detect some changes,3 it is not able to modify objects not owned by the provided maintainer4 and some attributes cannot be modified, requiring to manually delete and recreate the updated object.5 However, this module should automate 95% of your needs.

  1. Other IRRs exist without being attached to a RIR. The most notable one is RADb.
  2. ARIN is phasing out this method in favor of IRR-online. RIPE has an API available, but email updates are still supported and not planned to be deprecated. APNIC plans to expose an API.
  3. For ARIN, we cannot query key-cert and mntner objects and therefore we cannot detect changes in them. It is also not possible to detect changes to the auth mechanisms of a mntner object.
  4. APNIC do not assign top-level objects to the maintainer associated with the owner.
  5. Changing the status of an inetnum object requires deleting and recreating the object.

19 September 2020

Vincent Bernat: Keepalived and unicast over multiple interfaces

Keepalived is a Linux implementation of VRRP. The usual role of VRRP is to share a virtual IP across a set of routers. For each VRRP instance, a leader is elected and gets to serve the IP address, ensuring the high availability of the attached service. Keepalived can also be used for a generic leader election, thanks to its ability to use scripts for healthchecking and run commands on state change. A simple configuration looks like this:
vrrp_instance gateway1  
  state BACKUP          #  
  interface eth0        #  
  virtual_router_id 12  #  
  priority 101          #  
  virtual_ipaddress  
    2001:db8:ff/64
   
 
The state keyword in instructs Keepalived to not take the leader role when starting. Otherwise, incoming nodes create a temporary disruption by taking over the IP address until the election settles. The interface keyword in defines the interface for sending and receiving VRRP packets. It is also the default interface to configure the virtual IP address. The virtual_router_id directive in is common to all nodes sharing the virtual IP. The priority keyword in helps choosing which router will be elected as leader. If you need more information around Keepalived, be sure to check the documentation. VRRP design is tied to Ethernet networks and requires a multicast-enabled network for communication between nodes. In some environments, notably public clouds, multicast is unavailable. In this case, Keepalived can send VRRP packets using unicast:
vrrp_instance gateway1  
  state BACKUP
  interface eth0
  virtual_router_id 12
  priority 101
  unicast_peer  
    2001:db8::11
    2001:db8::12
   
  virtual_ipaddress  
    2001:db8:ff/64 dev lo
   
 
Another process, like a BGP daemon, should advertise the virtual IP address to the network . If needed, Keepalived can trigger whatever action is needed for this by using notify_* scripts. Until version 2.21 (not released yet), the interface directive is mandatory and Keepalived will transmit and receive VRRP packets on this interface only. If peers are reachable through several interfaces, like on a BGP on the host setup, you need a workaround. A simple one is to use a VXLAN interface:
$ ip -6 link add keepalived6 type vxlan id 6 dstport 4789 local 2001:db8::10 nolearning
$ bridge fdb append 00:00:00:00:00:00 dev keepalived6 dst 2001:db8::11
$ bridge fdb append 00:00:00:00:00:00 dev keepalived6 dst 2001:db8::12
$ ip link set up dev keepalived6
Learning of MAC addresses is disabled and one generic entry for each peer is added in the forwarding database: transmitted packets are broadcasted to all peers, notably VRRP packets. Have a look at VXLAN & Linux for additional details.
vrrp_instance gateway1  
  state BACKUP
  interface keepalived6
  mcast_src_ip 2001:db8::10
  virtual_router_id 12
  priority 101
  virtual_ipaddress  
    2001:db8:ff/64 dev lo
   
 
Starting from Keepalived 2.21, unicast_peer can be used without the interface directive. I think using VXLAN is still a neat trick applicable to other situations where communication using broadcast or multicast is needed, while the underlying network provide no support for this.

Vincent Bernat: Syncing NetBox with a custom Ansible module

The netbox.netbox collection from Ansible Galaxy provides several modules to update NetBox objects:
- name: create a device in NetBox
  netbox_device:
    netbox_url: http://netbox.local
    netbox_token: s3cret
    data:
      name: to3-p14.sfo1.example.com
      device_type: QFX5110-48S
      device_role: Compute Switch
      site: SFO1
However, if NetBox is not your source of truth, you may want to ensure it stays in sync with your configuration management database1 by removing outdated devices or IP addresses. While it should be possible to glue together a playbook with a query, a loop and some filtering to delete unwanted elements, it feels clunky, inefficient and an abuse of YAML as a programming language. A specific Ansible module solves this issue and is likely more flexible.

Notice I recommend that you read Writing a custom Ansible module as an introduction, as well as Syncing MySQL tables for a first simpler example.

Code The module has the following signature and it syncs NetBox with the content of the provided YAML file:
netbox_sync:
  source: netbox.yaml
  api: https://netbox.example.com
  token: s3cret
The synchronized objects are:
  • sites,
  • manufacturers,
  • device types,
  • device roles,
  • devices, and
  • IP addresses.
In our environment, the YAML file is generated from our configuration management database and contains a set of devices and a list of IP addresses:
devices:
  ad2-p6.sfo1.example.com:
     datacenter: sfo1
     manufacturer: Cisco
     model: Catalyst 2960G-48TC-L
     role: net_tor_oob_switch
  to1-p6.sfo1.example.com:
     datacenter: sfo1
     manufacturer: Juniper
     model: QFX5110-48S
     role: net_tor_gpu_switch
# [ ]
ips:
  - device: ad2-p6.example.com
    ip: 172.31.115.18/21
    interface: oob
  - device: to1-p6.example.com
    ip: 172.31.115.33/21
    interface: oob
  - device: to1-p6.example.com
    ip: 172.31.254.33/32
    interface: lo0.0
# [ ]
The network team is not the sole tenant in NetBox. While adding new objects or modifying existing ones should be relatively safe, deleting unwanted objects can be risky. The module only deletes objects it did create or modify. To identify them, it marks them with a specific tag, cmdb. Most objects in NetBox accept tags.

Module definition Starting from the skeleton described in the previous article, we define the module:
module_args = dict(
    source=dict(type='path', required=True),
    api=dict(type='str', required=True),
    token=dict(type='str', required=True, no_log=True),
    max_workers=dict(type='int', required=False, default=10)
)
result = dict(
    changed=False
)
module = AnsibleModule(
    argument_spec=module_args,
    supports_check_mode=True
)
It contains an additional optional arguments defining the number of workers to talk to NetBox and query the existing objects in parallel to speedup the execution.

Abstracting synchronization We need to synchronize different object types, but once we have a list of objects we want in NetBox, the grunt work is always the same:
  • check if the objects already exist,
  • retrieve them and put them in a form suitable for comparison,
  • retrieve the extra objects we don t want anymore,
  • compare the two sets, and
  • add missing objects, update existing ones, delete extra ones.
We code these behaviours into a Synchronizer abstract class. For each kind of object, a concrete class is built with the appropriate class attributes to tune its behaviour and a wanted() method to provide the objects we want. I am not explaining the abstract class code here. Have a look at the source if you want.

Synchronizing tags and tenants As a starter, here is how we define the class synchronizing the tags:
class SyncTags(Synchronizer):
    app = "extras"
    table = "tags"
    key = "name"
    def wanted(self):
        return  "cmdb": dict(
            slug="cmdb",
            color="8bc34a",
            description="synced by network CMDB") 
The app and table attributes defines the NetBox objects we want to manipulate. The key attribute is used to determine how to lookup for existing objects. In this example, we want to lookup tags using their names. The wanted() method is expected to return a dictionary mapping object keys to the list of wanted attributes. Here, the keys are tag names and we create only one tag, cmdb, with the provided slug, color and description. This is the tag we will use to mark the objects we create or modify. If the tag does not exist, it is created. If it exists, the provided attributes are updated. Other attributes are left untouched. We also want to create a specific tenant for objects accepting such an attribute (devices and IP addresses):
class SyncTenants(Synchronizer):
    app = "tenancy"
    table = "tenants"
    key = "name"
    def wanted(self):
        return  "Network": dict(slug="network",
                                description="Network team") 

Synchronizing sites We also need to synchronize the list of sites. This time, the wanted() method uses the information provided in the YAML file: it walks the devices and builds a set of datacenter names.
class SyncSites(Synchronizer):
    app = "dcim"
    table = "sites"
    key = "name"
    only_on_create = ("status", "slug")
    def wanted(self):
        result = set(details["datacenter"]
                     for details in self.source['devices'].values()
                     if "datacenter" in details)
        return  k: dict(slug=k,
                        status="planned")
                for k in result 
Thanks to the use of the only_on_create attribute, the specified attributes are not updated if they are different. The goal of this synchronizer is mostly to collect the references to the different sites for other objects.
>>> pprint(SyncSites(**sync_args).wanted())
 'sfo1':  'slug': 'sfo1', 'status': 'planned' ,
 'chi1':  'slug': 'chi1', 'status': 'planned' ,
 'nyc1':  'slug': 'nyc1', 'status': 'planned' 

Synchronizing manufacturers, device types and device roles The synchronization of manufacturers is pretty similar, except we do not use the only_on_create attribute:
class SyncManufacturers(Synchronizer):
    app = "dcim"
    table = "manufacturers"
    key = "name"
    def wanted(self):
        result = set(details["manufacturer"]
                     for details in self.source['devices'].values()
                     if "manufacturer" in details)
        return  k:  "slug": slugify(k) 
                for k in result 
Regarding the device types, we use the foreign attribute linking a NetBox attribute to the synchronizer handling it.
class SyncDeviceTypes(Synchronizer):
    app = "dcim"
    table = "device_types"
    key = "model"
    foreign =  "manufacturer": SyncManufacturers 
    def wanted(self):
        result = set((details["manufacturer"], details["model"])
                     for details in self.source['devices'].values()
                     if "model" in details)
        return  k[1]: dict(manufacturer=k[0],
                           slug=slugify(k[1]))
                for k in result 
The wanted() method refers to the manufacturer using its key attribute. In this case, this is the manufacturer name.
>>> pprint(SyncManufacturers(**sync_args).wanted())
 'Cisco':  'slug': 'cisco' ,
 'Dell':  'slug': 'dell' ,
 'Juniper':  'slug': 'juniper' 
>>> pprint(SyncDeviceTypes(**sync_args).wanted())
 'ASR 9001':  'manufacturer': 'Cisco', 'slug': 'asr-9001' ,
 'Catalyst 2960G-48TC-L':  'manufacturer': 'Cisco',
                           'slug': 'catalyst-2960g-48tc-l' ,
 'MX10003':  'manufacturer': 'Juniper', 'slug': 'mx10003' ,
 'QFX10002-36Q':  'manufacturer': 'Juniper', 'slug': 'qfx10002-36q' ,
 'QFX10002-72Q':  'manufacturer': 'Juniper', 'slug': 'qfx10002-72q' ,
 'QFX5110-32Q':  'manufacturer': 'Juniper', 'slug': 'qfx5110-32q' ,
 'QFX5110-48S':  'manufacturer': 'Juniper', 'slug': 'qfx5110-48s' ,
 'QFX5200-32C':  'manufacturer': 'Juniper', 'slug': 'qfx5200-32c' ,
 'S4048-ON':  'manufacturer': 'Dell', 'slug': 's4048-on' ,
 'S6010-ON':  'manufacturer': 'Dell', 'slug': 's6010-on' 
The device roles are defined like this:
class SyncDeviceRoles(Synchronizer):
    app = "dcim"
    table = "device_roles"
    key = "name"
    def wanted(self):
        result = set(details["role"]
                     for details in self.source['devices'].values()
                     if "role" in details)
        return  k: dict(slug=slugify(k),
                        color="8bc34a")
                for k in result 

Synchronizing devices A device is mostly a name with references to a role, a model, a datacenter and a tenant. These references are declared as foreign keys using the synchronizers defined previously.
class SyncDevices(Synchronizer):
    app = "dcim"
    table = "devices"
    key = "name"
    foreign =  "device_role": SyncDeviceRoles,
               "device_type": SyncDeviceTypes,
               "site": SyncSites,
               "tenant": SyncTenants 
    remove_unused = 10
    def wanted(self):
        return  name: dict(device_role=details["role"],
                           device_type=details["model"],
                           site=details["datacenter"],
                           tenant="Network")
                for name, details in self.source['devices'].items()
                if  "datacenter", "model", "role"  <= set(details.keys()) 
The remove_unused attribute is a safety implemented to fail if we have to delete more than 10 devices: this may be the indication there is a bug somewhere, unless one of your datacenter suddenly caught fire.
>>> pprint(SyncDevices(**sync_args).wanted())
 'ad2-p6.sfo1.example.com':  'device_role': 'net_tor_oob_switch',
                             'device_type': 'Catalyst 2960G-48TC-L',
                             'site': 'sfo1',
                             'tenant': 'Network' ,
 'to1-p6.sfo1.example.com':  'device_role': 'net_tor_gpu_switch',
                             'device_type': 'QFX5110-48S',
                             'site': 'sfo1',
                             'tenant': 'Network' ,
[ ]

Synchronizing IP addresses The last step is to synchronize IP addresses. We do not attach them to a device.2 Instead, we specify the device names in the description of the IP address:
class SyncIPs(Synchronizer):
    app = "ipam"
    table = "ip-addresses"
    key = "address"
    foreign =  "tenant": SyncTenants 
    remove_unused = 1000
    def wanted(self):
        wanted =  
        for details in self.source['ips']:
            if details['ip'] in wanted:
                wanted[details['ip']]['description'] = \
                    f" details['device']  (and others)"
            else:
                wanted[details['ip']] = dict(
                    tenant="Network",
                    status="active",
                    dns_name="",        # information is present in DNS
                    description=f" details['device'] :  details['interface'] ",
                    role=None,
                    vrf=None)
        return wanted
There is a slight difficulty: NetBox allows duplicate IP addresses, so a simple lookup is not enough. In case of multiple matches, we choose the best by preferring those tagged with cmdb, then those already attached to an interface:
def get(self, key):
    """Grab IP address from NetBox."""
    # There may be duplicate. We need to grab the "best".
    results = super(Synchronizer, self).get(key)
    if len(results) == 0:
        return None
    if len(results) == 1:
        return results[0]
    scores = [0]*len(results)
    for idx, result in enumerate(results):
        if "cmdb" in result.tags:
            scores[idx] += 10
        if result.interface is not None:
            scores[idx] += 5
    return sorted(zip(scores, results),
                  reverse=True, key=lambda k: k[0])[0][1]

Getting the current and wanted states Each synchronizer is initialized with a reference to the Ansible module, a reference to a pynetbox s API object, the data contained in the provided YAML file and two empty dictionaries for the current and expected states:
source = yaml.safe_load(open(module.params['source']))
netbox = pynetbox.api(module.params['api'],
                      token=module.params['token'])
sync_args = dict(
    module=module,
    netbox=netbox,
    source=source,
    before= ,
    after= 
)
synchronizers = [synchronizer(**sync_args) for synchronizer in [
    SyncTags,
    SyncTenants,
    SyncSites,
    SyncManufacturers,
    SyncDeviceTypes,
    SyncDeviceRoles,
    SyncDevices,
    SyncIPs
]]
Each synchronizer has a prepare() method whose goal is to compute the current and wanted states. It returns True in case of a difference:
# Check what needs to be synchronized
try:
    for synchronizer in synchronizers:
        result['changed']  = synchronizer.prepare()
except AnsibleError as e:
    result['msg'] = e.message
    module.fail_json(**result)

Applying changes Back to the skeleton described in the previous article, the last step is to apply the changes if there is a difference between these states. Each synchronizer registers the current and wanted states in sync_args["before"][table] and sync_args["after"][table] where table is the name of the table for a given NetBox object type. The diff object is a bit elaborate as it is built table by table. This enables Ansible to display the name of each table before the diff representation:
# Compute the diff
if module._diff and result['changed']:
    result['diff'] = [
        dict(
            before_header=table,
            after_header=table,
            before=yaml.safe_dump(sync_args["before"][table]),
            after=yaml.safe_dump(sync_args["after"][table]))
        for table in sync_args["after"]
        if sync_args["before"][table] != sync_args["after"][table]
    ]
# Stop here if check mode is enabled or if no change
if module.check_mode or not result['changed']:
    module.exit_json(**result)
Each synchronizer also exposes a synchronize() method to apply changes and a cleanup() method to delete unwanted objects. Order is important due to the relation between the objects.
# Synchronize
for synchronizer in synchronizers:
    synchronizer.synchronize()
for synchronizer in synchronizers[::-1]:
    synchronizer.cleanup()
module.exit_json(**result)

The complete code is available on GitHub. Compared to using netbox.netbox collection, the logic is written in Python instead of trying to glue Ansible tasks together. I believe this is both more flexible and easier to read, notably when trying to delete outdated objects. While I did not test it, it should also be faster. An alternative would have been to reuse code from the netbox.netbox collection, as it contains similar primitives. Unfortunately, I didn t think of it until now.

  1. In my opinion, a good option for a source of truth is to use YAML files in a Git repository. You get versioning for free and people can get started with a text editor.
  2. This limitation is mostly due to laziness: we do not really care about this information. Our main motivation for putting IP addresses in NetBox is to keep track of the used IP addresses. However, if an IP address is already attached to an interface, we leave this association untouched.

2 September 2020

Vincent Bernat: Syncing MySQL tables with a custom Ansible module

The community.mysql collection from Ansible Galaxy provides a mysql_query module to run arbitrary MySQL queries. Unfortunately, it does not support check mode nor the --diff flag. It is also unable to tell if there was a change. Let s write a specific Ansible module to workaround these issues.

Notice I recommend that you read Writing a custom Ansible module as an introduction.

Code The module has the following signature and it executes the provided SQL statements in a single transaction. It needs a list of the affected tables to be able to detect and show the changes.
mysql_sync:
  sql:  
    DELETE FROM rules WHERE name LIKE 'CMDB:%';
    INSERT INTO rules (name, rule) VALUES
      ('CMDB: check for cats', ':is(object, "CAT")'),
      ('CMDB: check for dogs', ':is(object, "DOG")');
    REPLACE INTO webhooks (name, url) VALUES
      ('OpsGenie', 'https://opsgenie/something/token'),
      ('Slack', 'https://slack/something/token');
  user: monitoring
  password: Yooghah5
  database: monitoring
  tables:
    - rules
    - webhooks

Prerequisites The module does not enforce idempotency, but it is expected you provide appropriate SQL queries. In the above example, idempotency is achieved because the content of the rules table is deleted and recreated from scratch while the rows in the webhooks table are replaced if they already exist. You need the PyMySQL package.

Module definition Starting from the skeleton described in the previous article, here is the module definition:
module_args = dict(
    sql=dict(type='str', required=True),
    user=dict(type='str', required=True),
    password=dict(type='str', required=True, no_log=True),
    database=dict(type='str', required=True),
    tables=dict(type='list', required=True, elements='str'),
)
result = dict(
    changed=False
)
module = AnsibleModule(
    argument_spec=module_args,
    supports_check_mode=True
)
The password is marked with no_log to ensure it won t be displayed or stored, notably when ansible-playbook runs in verbose mode. There is no host option as the module is executed on the MySQL host. Strong authentication using certificates is not implemented either. This matches our goal with custom modules: only implement what you strictly need.

Getting the current rows The next step is to retrieve the records currently in the database. The got dictionary is a mapping from table names to the list of rows they contain:
got =  
tables = module.params['tables']
connection = pymysql.connect(
    user=module.params['user'],
    password=module.params['password'],
    db=module.params['database'],
    charset='utf8mb4',
    cursorclass=pymysql.cursors.DictCursor
)
with connection.cursor() as cursor:
    for table in tables:
        cursor.execute("SELECT * FROM  ".format(table))
        got[table] = cursor.fetchall()

Computing the changes Let s now build the wanted dictionary. The trick is to execute the SQL statements in a transaction without issuing a final commit. The changes will be invisible1 to other readers and we can compare the final rows with the rows collected in got:
wanted =  
sql = module.params['sql']
statements = [statement.strip()
              for statement in sql.split(";\n")
              if statement.strip()]
with connection.cursor() as cursor:
    for statement in statements:
        try:
            cursor.execute(statement)
        except pymysql.OperationalError as err:
            code, message = err.args
            result['msg'] = "MySQL error for  :  ".format(
                statement,
                message)
            module.fail_json(**result)
    for table in tables:
        cursor.execute("SELECT * FROM  ".format(table))
        wanted[table] = cursor.fetchall()
The first for loop executes each statement. On error, we return a helpful message containing the faulty one. The second for loop records the final rows of each table in wanted.

Applying changes Back to the skeleton described in the previous article, the last step is to apply the changes if there is a difference between got and wanted when not running with check mode. The diff object is a bit more elaborate as it is built table by table. This enables Ansible to display the name of each table before the diff representation:
if got != wanted:
    result['changed'] = True
    result['diff'] = [dict(
        before_header=table,
        after_header=table,
        before=yaml.safe_dump(got[table]),
        after=yaml.safe_dump(wanted[table]))
                      for table in tables
                      if got[table] != wanted[table]]
if module.check_mode or not result['changed']:
    module.exit_json(**result)
Applying the changes is quite trivial: just commit them! Otherwise, they are lost when the module exits.
connection.commit()

The complete code is available on GitHub. Compared to the mysql_query module, this one supports the check mode, signals correctly if there is a change and displays the differences. However, it should not be used with huge tables, as it would try to load them in memory.

  1. The tables need to use the InnoDB storage engine. Moreover, MySQL does not know how to use transactions with DDL statements: do not modify table definitions!

Vincent Bernat: Syncing SSH keys on Cisco IOS-XR with a custom Ansible module

The cisco.iosxr collection from Ansible Galaxy provides an iosxr_user module to manage local users, along with their SSH keys. However, the module is quite slow, do not display a diff for changed SSH keys, never signal change when a key is modified, and does not delete obsolete keys. Let s write a custom Ansible module managing only the SSH keys while fixing these issues.

Notice I recommend that you read Writing a custom Ansible module as an introduction.

How to add an SSH key to a user Adding SSH keys to users in Cisco IOS-XR is quite undocumented. First, you need to encode the key with the ssh-rsa key ASN.1 format, like an OpenSSH public key, but without the base64-encoding:
$ awk ' print $2 ' id_rsa.pub \
      base64 -d \
    > publickey_vincent.raw
Then, you upload the key with SCP to harddisk:/publickey_vincent.raw and import it for the current user with the following IOS command:
crypto key import authentication rsa harddisk:/publickey_vincent.b64
However, if you want to import a key for another user, you need to be part of the root-system group:
username vincent
 group root-lr
 group root-system
With the following admin command, you can attach a key to another user:
admin crypto key import authentication rsa username cedric harddisk:/publickey_cedric.b64

Code The module has the following signature and it installs the specified key for each user and remove keys from retired users the ones we do not specify.
iosxr_users:
  keys:
    vincent: ssh-rsa AAAAB3NzaC1yc2EAA[ ]ymh+YrVWLZMJR
    cedric:  ssh-rsa AAAAB3NzaC1yc2EAA[ ]RShPA8w/8eC0n

Prerequisites Unlike the iosxr_user module, our custom module only handles SSH keys, one per user. Therefore, the user definitions have to already exist in the running configuration.1 Moreover, the user defined in ansible_user needs to be in the root-system group. The cisco.iosxr collection must also be installed as the module relies on its code. When running the module, ansible_connection needs to be set to network_cli and ansible_network_os to iosxr. These variables are usually defined in the inventory.

Module definition Starting from the skeleton described in the previous article, we define the module:
module_args = dict(
    keys=dict(type='dict', elements='str', required=True),
)
module = AnsibleModule(
    argument_spec=module_args,
    supports_check_mode=True
)
result = dict(
    changed=False
)

Getting the installed keys The next step is to retrieve the keys currently installed. This can be done with the following command:
# show crypto key authentication rsa all
Key label: vincent
Type     : RSA public key authentication
Size     : 2048
Imported : 16:17:08 UTC Tue Aug 11 2020
Data     :
 30820122 300D0609 2A864886 F70D0101 01050003 82010F00 3082010A 02820101
 00D81E5B A73D82F3 77B1E4B5 949FB245 60FB9167 7CD03AB7 ADDE7AFE A0B83174
 A33EC0E6 1C887E02 2338367A 8A1DB0CE 0C3FBC51 15723AEB 07F301A4 B1A9961A
 2D00DBBD 2ABFC831 B0B25932 05B3BC30 B9514EA1 3DC22CBD DDCA6F02 026DBBB6
 EE3CFADA AFA86F52 CAE7620D 17C3582B 4422D24F D68698A5 52ED1E9E 8E41F062
 7DE81015 F33AD486 C14D0BB1 68C65259 F9FD8A37 8DE52ED0 7B36E005 8C58516B
 7EA6C29A EEE0833B 42714618 50B3FFAC 15DBE3EF 8DA5D337 68DAECB9 904DE520
 2D627CEA 67E6434F E974CF6D 952AB2AB F074FBA3 3FB9B9CC A0CD0ADC 6E0CDB2A
 6A1CFEBA E97AF5A9 1FE41F6C 92E1F522 673E1A5F 69C68E11 4A13C0F3 0FFC782D
 27020301 0001
[ ]
ansible_collections.cisco.iosxr.plugins.module_utils.network.iosxr.iosxr contains a run_commands() function we can use:
command = "show crypto key authentication rsa all"
out = run_commands(module, command)
out = out[0].replace(' \n', '\n')
A common library to parse a command output is textfsm: a Python module using a template-based state machine for parsing semi-formatted text.
template = r"""
Value Required Label (\w+)
Value Required,List Data ([A-F0-9 ]+)
Start
 ^Key label: $ Label 
 ^Data\s+: -> GetData
GetData
 ^ $ Data 
 ^$$ -> Record Start
""".lstrip()
re_table = textfsm.TextFSM(io.StringIO(template))
got =  data[0]: "".join(data[1]).replace(' ', '')
       for data in re_table.ParseText(out) 
got is a dictionary associating key labels, considered as usernames, with a hexadecimal representation of the public key currently installed. It looks like this:
>>> pprint(got)
 'alfred': '30820122300D0609[ ]6F0203010001',
 'cedric': '30820122300D0609[ ]710203010001',
 'vincent': '30820122300D0609[ ]270203010001' 

Comparing with the wanted keys Let s now build the wanted dictionary using the same structure. In module.params['keys'], we have a dictionary associating usernames to public SSH keys in the OpenSSH format:
>>> pprint(module.params['keys'])
 'cedric': 'ssh-rsa AAAAB3NzaC1yc2[ ]',
 'vincent': 'ssh-rsa AAAAB3NzaC1yc2[ ]' 
We need to convert these keys in the same hexadecimal representation used by Cisco above. The ssh-keygen command and some glue can do the conversion:2
$ ssh-keygen -f id_rsa.pub -e -mPKCS8 \
     grep -v '^---' \
     base64 -d \
     hexdump -e '4/1 "%0.2X"'
30820122300D06092[ ]782D270203010001
Assuming we have a ssh2cisco() function doing that, we can build the wanted dictionary:
wanted =  k: ssh2cisco(v)
          for k, v in module.params['keys'].items() 

Applying changes Back to the skeleton described in the previous article, the last step is to apply the changes if there is a difference between got and wanted when not running with check mode. The part comparing got and wanted is taken verbatim from the skeleton module:
if got != wanted:
    result['changed'] = True
    result['diff'] = dict(
        before=yaml.safe_dump(got),
        after=yaml.safe_dump(wanted)
    )
if module.check_mode or not result['changed']:
    module.exit_json(**result)
Let s copy the new or changed keys and attach them to their respective users. For this purpose, we reuse the get_connection() and copy_file() functions from ansible_collections.cisco.iosxr.plugins.module_utils.network.iosxr.iosxr.
conn = get_connection(module)
for user in wanted:
    if user not in got or wanted[user] != got[user]:
        dst = f"/harddisk:/publickey_ user .raw"
        with tempfile.NamedTemporaryFile() as src:
            decoded = base64.b64decode(
                module.params['keys'][user].split()[1])
            src.write(decoded)
            src.flush()
            copy_file(module, src.name, dst)
    command = ("admin crypto key import authentication rsa "
               f"username  user   dst ")
    conn.send_command(command, prompt="yes/no", answer="yes")
Then, we remove obsolete keys:
for user in got:
    if user not in wanted:
        command = ("admin crypto key zeroize authentication rsa "
                   f"username  user ")
        conn.send_command(command, prompt="yes/no", answer="yes")

The complete code is available on GitHub. Compared to the iosxr_user module, this one displays a diff when running with --diff, correctly signals a change, is faster, 3 and deletes unwanted SSH keys. However, it is unable to create users and cannot configure passwords or multiple SSH keys.

  1. In our environment, the Ansible playbook pushes a full configuration, including the user definitions. Then, it synchronizes the SSH keys.
  2. Despite the argument provided to ssh-keygen, the format used by Cisco is not PKCS#8. This is the ASN.1 representation of a Subject Public Key Info structure, as defined in RFC 2459. Moreover, PKCS#8 is a format for a private key, not a public one.
  3. The main factors for being faster are:
    • not creating users, and
    • not reuploading existing SSH keys.

Vincent Bernat: Writing a custom Ansible module

Ansible ships a lot of modules you can combine for your configuration management needs. However, the quality of these modules may vary widely. Sometimes, it may be quicker and more robust to write your own module instead of shopping and assembling existing ones.1 In my opinion, a robust module exhibits the following characteristics: In a nutshell, it means the module can run with --diff --check and shows the changes it would apply. When run twice in a row, the second run won t apply or signal changes. The last bullet point suggests the module should be able to delete outdated objects configured during previous runs.2 The module code should be minimal and tailored to your needs. Making the module generic for use by other users is a non-goal. Less code usually means less bugs and easier to understand. I do not cover testing here. It is undeniably a good practice, but it requires a significant effort. In my opinion, it is preferable to have a well written module matching the above characteristics rather than a module that is well tested but without them or a module requiring further (untested) assembly to meet your needs.

Module skeleton Ansible documentation contains instructions to build a module, along with some best practices. As one of our non-goal is to distribute it, we choose to take some shortcuts and skip some of the boilerplate. Let s assume we build a module with the following signature:
custom_module:
  user: someone
  password: something
  data: "some random string"
There are various locations you can put a module in Ansible. A common possibility is to include it into a role. In a library/ subdirectory, create an empty __init__.py file and a custom_module.py file with the following code:3
#!/usr/bin/python
import yaml
from ansible.module_utils.basic import AnsibleModule
def main():
    # Define options accepted by the module.  
    module_args = dict(
        user=dict(type='str', required=True),
        password=dict(type='str', required=True, no_log=True),
        data=dict(type='str', required=True),
    )
    module = AnsibleModule(
        argument_spec=module_args,
        supports_check_mode=True
    )
    result = dict(
        changed=False
    )
    got =  
    wanted =  
    # Populate both  got  and  wanted .  
    # [...]
    if got != wanted:
        result['changed'] = True
        result['diff'] = dict(
            before=yaml.safe_dump(got),
            after=yaml.safe_dump(wanted)
        )
    if module.check_mode or not result['changed']:
        module.exit_json(**result)
    # Apply changes.  
    # [...]
    module.exit_json(**result)
if __name__ == '__main__':
    main()
The first part, in , defines the module, with the accepted options. Refer to the documentation on argument_spec for more details. The second part, in , builds the got and wanted variables. got is the current state while wanted is the target state. For example, if you need to modify records in a database server, got would be the current rows while wanted would be the modified rows. Then, we compare got and wanted. If there is a difference, changed is switched to True and we prepare the diff object. Ansible uses it to display the differences between the states. If we are running in check mode or if no change is detected, we stop here. The last part, in , applies the changes. Usually, it means iterating over the two structures to detect the differences and create the missing items, delete the unwanted ones and update the existing ones.

Documentation Ansible provides a fairly complete page on how to document a module. I advise you to take a more minimal approach by only documenting each option sparingly,4 skipping the examples and only documenting return values if it needs to. I usually limit myself to something like this:
DOCUMENTATION = """
---
module: custom_module.py
short_description: Pass provided data to remote service
description:
  - Mention anything useful for your workmate.
  - Also mention anything you want to remember in 6 months.
options:
  user:
    description:
      - user to identify to remote service
  password:
    description:
      - password for authentication to remote service
  data:
    description:
      - data to send to remote service
"""

Error handling If you run into an error, you can stop the execution with module.fail_json():
module.fail_json(
    msg=f"remote service answered with  code :  message ",
    **result
)
There is no requirement to intercept all errors. Sometimes, not swallowing an exception provides better information than replacing it with a generic message.

Returning additional values A module may return additional information that can be captured to be used in another task through the register directive. For this purpose, you can add arbitrary fields to the result dictionary. Have a look at the documentation for common return values. You should try to add these fields before exiting the module when in check mode. The returned values can be documented.

Examples Here are several examples of custom modules following the previous skeleton. Each example highlight why a custom module was written instead of assembling existing modules.

  1. Also, when using modules from Ansible Galaxy, you introduce a dependency to a third-party. This is not something that should be decided lightly: it may break later, it may only meet 80% of the needs, it may add bugs.
  2. Some declarative systems, like Terraform, exhibits all these behaviors.
  3. Do not worry about the shebang. It is hardcoded to /usr/bin/python. Ansible will modify it to match the chosen interpreter on the remote host. You can write Python 3 code if ansible_python_interpreter evaluates to a Python 3 interpreter.
  4. The main issue I have with this non-programmatic approach to documentation is that it partly repeats the information contained in argument_spec. I think an auto-documenting structure would avoid this.

23 August 2020

Vincent Bernat: Zero-Touch Provisioning for Cisco IOS

The official documentation to automatically upgrade and configure on first boot a Cisco switch running on IOS, like a Cisco Catalyst 2960-X Series switch, is scarce on details. This note explains how to configure the ISC DHCP Server for this purpose.
When booting for the first time, Cisco IOS sends a DHCP request on all ports:
Dynamic Host Configuration Protocol (Discover)
    Message type: Boot Request (1)
    Hardware type: Ethernet (0x01)
    Hardware address length: 6
    Hops: 0
    Transaction ID: 0x0000117c
    Seconds elapsed: 0
    Bootp flags: 0x8000, Broadcast flag (Broadcast)
    Client IP address: 0.0.0.0
    Your (client) IP address: 0.0.0.0
    Next server IP address: 0.0.0.0
    Relay agent IP address: 0.0.0.0
    Client MAC address: Cisco_6c:12:c0 (b4:14:89:6c:12:c0)
    Client hardware address padding: 00000000000000000000
    Server host name not given
    Boot file name not given
    Magic cookie: DHCP
    Option: (53) DHCP Message Type (Discover)
    Option: (57) Maximum DHCP Message Size
    Option: (61) Client identifier
        Length: 25
        Type: 0
        Client Identifier: cisco-b414.896c.12c0-Vl1
    Option: (55) Parameter Request List
        Length: 12
        Parameter Request List Item: (1) Subnet Mask
        Parameter Request List Item: (66) TFTP Server Name
        Parameter Request List Item: (6) Domain Name Server
        Parameter Request List Item: (15) Domain Name
        Parameter Request List Item: (44) NetBIOS over TCP/IP Name Server
        Parameter Request List Item: (3) Router
        Parameter Request List Item: (67) Bootfile name
        Parameter Request List Item: (12) Host Name
        Parameter Request List Item: (33) Static Route
        Parameter Request List Item: (150) TFTP Server Address
        Parameter Request List Item: (43) Vendor-Specific Information
        Parameter Request List Item: (125) V-I Vendor-specific Information
    Option: (255) End
It requests a number of options, including the Bootfile name option 67, the TFTP server address option 150 and the Vendor-Identifying Vendor-Specific Information Option 125 or VIVSO. Option 67 provides the name of the configuration file located on the TFTP server identified by option 150. Option 125 includes the name of the file describing the Cisco IOS image to use to upgrade the switch. This file only contains the name of the tarball embedding the image.1 Configuring the ISC DHCP Server to answer with the TFTP server address and the name of the configuration file is simple enough:
filename "ob2-p2.example.com";
option tftp-server-address 172.16.15.253;
However, if you want to also provide the image for upgrade, you have to specify a hexadecimal-encoded string:2
option vivso 00:00:00:09:24:05:22:63:32:39:36:30:2d:6c:61:6e:62:61:73:65:6b:39:2d:74:61:72:2e:31:35:30:2d:32:2e:53:45:31:31:2e:74:78:74;
Having a large hexadecimal-encoded string inside a configuration file is quite unsatisfying. Instead, the ISC DHCP Server allows you to express this information in a more readable way using the option space statement:
# Create option space for Cisco and encapsulate it in VIVSO/vendor space
option space cisco code width 1 length width 1;
option cisco.auto-update-image code 5 = text;
option vendor.cisco code 9 = encapsulate cisco;
# Image description for Cisco IOS ZTP
option cisco.auto-update-image = "c2960-lanbasek9-tar.150-2.SE11.txt";
# Workaround for VIVSO option 125 not being sent
option vendor.iana code 0 = string;
option vendor.iana = 01:01:01;
Without the workaround mentioned in the last block, the ISC DHCP Server would not send back option 125. With such a configuration, it returns the following answer, including a harmless additional enterprise 0 encapsulated into option 125:
Dynamic Host Configuration Protocol (Offer)
    Message type: Boot Reply (2)
    Hardware type: Ethernet (0x01)
    Hardware address length: 6
    Hops: 0
    Transaction ID: 0x0000117c
    Seconds elapsed: 0
    Bootp flags: 0x8000, Broadcast flag (Broadcast)
    Client IP address: 0.0.0.0
    Your (client) IP address: 172.16.15.6
    Next server IP address: 0.0.0.0
    Relay agent IP address: 0.0.0.0
    Client MAC address: Cisco_6c:12:c0 (b4:14:89:6c:12:c0)
    Client hardware address padding: 00000000000000000000
    Server host name not given
    Boot file name: ob2-p2.example.com
    Magic cookie: DHCP
    Option: (53) DHCP Message Type (Offer)
    Option: (54) DHCP Server Identifier (172.16.15.252)
    Option: (51) IP Address Lease Time
    Option: (1) Subnet Mask (255.255.248.0)
    Option: (6) Domain Name Server
    Option: (3) Router
    Option: (150) TFTP Server Address
        Length: 4
        TFTP Server Address: 172.16.15.252
    Option: (125) V-I Vendor-specific Information
        Length: 49
        Enterprise: Reserved (0)
        Enterprise: ciscoSystems (9)
            Length: 36
            Option 125 Suboption: 5
                Length: 34
                Data: 63323936302d6c616e626173656b392d7461722e3135302d 
    Option: (255) End

  1. The reason of this indirection is still puzzling me. I suppose it could be because updating the image name directly in option 125 is quite a hassle.
  2. It contains the following information:
    • 0x00000009: Cisco s Enterprise Number,
    • 0x24: length of the enclosed data,
    • 0x05: Cisco s auto-update sub-option,
    • 0x22: length of the sub-option data, and
    • filename of the image description (c2960-lanbasek9-tar.150-2.SE11.txt).

5 April 2020

Vincent Bernat: Safer SSH agent forwarding

ssh-agent is a program to hold in memory the private keys used by SSH for public-key authentication. When the agent is running, ssh forwards to it the signature requests from the server. The agent performs the private key operations and returns the results to ssh. It is useful if you keep your private keys encrypted on disk and you don t want to type the password at each connection. Keeping the agent secure is critical: someone able to communicate with the agent can authenticate on your behalf on remote servers. ssh also provides the ability to forward the agent to a remote server. From this remote server, you can authenticate to another server using your local agent, without copying your private key on the intermediate server. As stated in the manual page, this is dangerous!
Agent forwarding should be enabled with caution. Users with the ability to bypass file permissions on the remote host (for the agent s UNIX-domain socket) can access the local agent through the forwarded connection. An attacker cannot obtain key material from the agent, however they can perform operations on the keys that enable them to authenticate using the identities loaded into the agent. A safer alternative may be to use a jump host (see -J).
As mentioned, a better alternative is to use the jump host feature: the SSH connection to the target host is tunneled through the SSH connection to the jump host. See the manual page and this blog post for more details.
If you really need to use SSH agent forwarding, you can secure it a bit through a dedicated agent with two main attributes: The following alias around the ssh command will spawn such an ephemeral agent:
alias assh="ssh-agent ssh -o AddKeysToAgent=confirm -o ForwardAgent=yes"
With the -o AddKeysToAgent=confirm directive, ssh adds the unencrypted private key to the agent but each use must be confirmed.1 Once connected, you get a password prompt for each signature request:2
ssh-agent prompt confirmation with fingerprint and yes/no buttons
Request for the agent to use the specified private key
But, again, avoid using agent forwarding!

Update (2020-04) In a previous version of this article, the wrapper around the ssh command was a more complex function. Alexandre Oliva was kind enough to point me to the simpler solution above.

Update (2020-04) Guardian Agent is an even safer alternative: it shows and ensures the usage (target and command) of the requested signature. There is also a wide range of alternative solutions to this problem. See for example SSH-Ident, Wikimedia solution and solo-agent.


  1. Alternatively, you can add the keys with ssh-add -c.
  2. Unfortunately, the dialog box default answer is Yes.

13 September 2017

Vincent Bernat: Route-based IPsec VPN on Linux with strongSwan

A common way to establish an IPsec tunnel on Linux is to use an IKE daemon, like the one from the strongSwan project, with a minimal configuration1:
conn V2-1
  left        = 2001:db8:1::1
  leftsubnet  = 2001:db8:a1::/64
  right       = 2001:db8:2::1
  rightsubnet = 2001:db8:a2::/64
  authby      = psk
  auto        = route
The same configuration can be used on both sides. Each side will figure out if it is left or right . The IPsec site-to-site tunnel endpoints are 2001:db8: 1::1 and 2001:db8: 2::1. The protected subnets are 2001:db8: a1::/64 and 2001:db8: a2::/64. As a result, strongSwan configures the following policies in the kernel:
$ ip xfrm policy
src 2001:db8:a1::/64 dst 2001:db8:a2::/64
        dir out priority 399999 ptype main
        tmpl src 2001:db8:1::1 dst 2001:db8:2::1
                proto esp reqid 4 mode tunnel
src 2001:db8:a2::/64 dst 2001:db8:a1::/64
        dir fwd priority 399999 ptype main
        tmpl src 2001:db8:2::1 dst 2001:db8:1::1
                proto esp reqid 4 mode tunnel
src 2001:db8:a2::/64 dst 2001:db8:a1::/64
        dir in priority 399999 ptype main
        tmpl src 2001:db8:2::1 dst 2001:db8:1::1
                proto esp reqid 4 mode tunnel
[ ]
This kind of IPsec tunnel is a policy-based VPN: encapsulation and decapsulation are governed by these policies. Each of them contains the following elements: When a matching policy is found, the kernel will look for a corresponding security association (using reqid and the endpoint source and destination addresses):
$ ip xfrm state
src 2001:db8:1::1 dst 2001:db8:2::1
        proto esp spi 0xc1890b6e reqid 4 mode tunnel
        replay-window 0 flag af-unspec
        auth-trunc hmac(sha256) 0x5b68[ ]8ba2904 128
        enc cbc(aes) 0x8e0e377ad8fd91e8553648340ff0fa06
        anti-replay context: seq 0x0, oseq 0x0, bitmap 0x00000000
[ ]
If no security association is found, the packet is put on hold and the IKE daemon is asked to negotiate an appropriate one. Otherwise, the packet is encapsulated. The receiving end identifies the appropriate security association using the SPI in the header. Two security associations are needed to establish a bidirectionnal tunnel:
$ tcpdump -pni eth0 -c2 -s0 esp
13:07:30.871150 IP6 2001:db8:1::1 > 2001:db8:2::1: ESP(spi=0xc1890b6e,seq=0x222)
13:07:30.872297 IP6 2001:db8:2::1 > 2001:db8:1::1: ESP(spi=0xcf2426b6,seq=0x204)
All IPsec implementations are compatible with policy-based VPNs. However, some configurations are difficult to implement. For example, consider the following proposition for redundant site-to-site VPNs: Redundant VPNs between 3 sites A possible configuration between V1-1 and V2-1 could be:
conn V1-1-to-V2-1
  left        = 2001:db8:1::1
  leftsubnet  = 2001:db8:a1::/64,2001:db8:a6::cc:1/128,2001:db8:a6::cc:5/128
  right       = 2001:db8:2::1
  rightsubnet = 2001:db8:a2::/64,2001:db8:a6::/64,2001:db8:a8::/64
  authby      = psk
  keyexchange = ikev2
  auto        = route
Each time a subnet is modified on one site, the configurations need to be updated on all sites. Moreover, overlapping subnets (2001:db8: a6::/64 on one side and 2001:db8: a6::cc:1/128 at the other) can also be problematic. The alternative is to use route-based VPNs: any packet traversing a pseudo-interface will be encapsulated using a security policy bound to the interface. This brings two features:
  1. Routing daemons can be used to distribute routes to be protected by the VPN. This decreases the administrative burden when many subnets are present on each side.
  2. Encapsulation and decapsulation can be executed in a different routing instance or namespace. This enables a clean separation between a private routing instance (where VPN users are) and a public routing instance (where VPN endpoints are).

Route-based VPN on Juniper Before looking at how to achieve that on Linux, let s have a look at the way it works with a JunOS-based platform (like a Juniper vSRX). This platform as long-standing history of supporting route-based VPNs (a feature already present in the Netscreen ISG platform). Let s assume we want to configure the IPsec VPN from V3-2 to V1-1. First, we need to configure the tunnel interface and bind it to the private routing instance containing only internal routes (with IPv4, they would have been RFC 1918 routes):
interfaces  
    st0  
        unit 1  
            family inet6  
                address 2001:db8:ff::7/127;
             
         
     
 
routing-instances  
    private  
        instance-type virtual-router;
        interface st0.1;
     
 
The second step is to configure the VPN:
security  
    /* Phase 1 configuration */
    ike  
        proposal IKE-P1  
            authentication-method pre-shared-keys;
            dh-group group20;
            encryption-algorithm aes-256-gcm;
         
        policy IKE-V1-1  
            mode main;
            proposals IKE-P1;
            pre-shared-key ascii-text "d8bdRxaY22oH1j89Z2nATeYyrXfP9ga6xC5mi0RG1uc";
         
        gateway GW-V1-1  
            ike-policy IKE-V1-1;
            address 2001:db8:1::1;
            external-interface lo0.1;
            general-ikeid;
            version v2-only;
         
     
    /* Phase 2 configuration */
    ipsec  
        proposal ESP-P2  
            protocol esp;
            encryption-algorithm aes-256-gcm;
         
        policy IPSEC-V1-1  
            perfect-forward-secrecy keys group20;
            proposals ESP-P2;
         
        vpn VPN-V1-1  
            bind-interface st0.1;
            df-bit copy;
            ike  
                gateway GW-V1-1;
                ipsec-policy IPSEC-V1-1;
             
            establish-tunnels on-traffic;
         
     
 
We get a route-based VPN because we bind the st0.1 interface to the VPN-V1-1 VPN. Once the VPN is up, any packet entering st0.1 will be encapsulated and sent to the 2001:db8: 1::1 endpoint. The last step is to configure BGP in the private routing instance to exchange routes with the remote site:
routing-instances  
    private  
        routing-options  
            router-id 1.0.3.2;
            maximum-paths 16;
         
        protocols  
            bgp  
                preference 140;
                log-updown;
                group v4-VPN  
                    type external;
                    local-as 65003;
                    hold-time 6;
                    neighbor 2001:db8:ff::6 peer-as 65001;
                    multipath;
                    export [ NEXT-HOP-SELF OUR-ROUTES NOTHING ];
                 
             
         
     
 
The export filter OUR-ROUTES needs to select the routes to be advertised to the other peers. For example:
policy-options  
    policy-statement OUR-ROUTES  
        term 10  
            from  
                protocol ospf3;
                route-type internal;
             
            then  
                metric 0;
                accept;
             
         
     
 
The configuration needs to be repeated for the other peers. The complete version is available on GitHub. Once the BGP sessions are up, we start learning routes from the other sites. For example, here is the route for 2001:db8: a1::/64:
> show route 2001:db8:a1::/64 protocol bgp table private.inet6.0 best-path
private.inet6.0: 15 destinations, 19 routes (15 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both
2001:db8:a1::/64   *[BGP/140] 01:12:32, localpref 100, from 2001:db8:ff::6
                      AS path: 65001 I, validation-state: unverified
                      to 2001:db8:ff::6 via st0.1
                    > to 2001:db8:ff::14 via st0.2
It was learnt both from V1-1 (through st0.1) and V1-2 (through st0.2). The route is part of the private routing instance but encapsulated packets are sent/received in the public routing instance. No route-leaking is needed for this configuration. The VPN cannot be used as a gateway from internal hosts to external hosts (or vice-versa). This could also have been done with JunOS security policies (stateful firewall rules) but doing the separation with routing instances also ensure routes from different domains are not mixed and a simple policy misconfiguration won t lead to a disaster.

Route-based VPN on Linux Starting from Linux 3.15, a similar configuration is possible with the help of a virtual tunnel interface3. First, we create the private namespace:
# ip netns add private
# ip netns exec private sysctl -qw net.ipv6.conf.all.forwarding=1
Any private interface needs to be moved to this namespace (no IP is configured as we can use IPv6 link-local addresses):
# ip link set netns private dev eth1
# ip link set netns private dev eth2
# ip netns exec private ip link set up dev eth1
# ip netns exec private ip link set up dev eth2
Then, we create vti6, a tunnel interface (similar to st0.1 in the JunOS example):
# ip tunnel add vti6 \
   mode vti6 \
   local 2001:db8:1::1 \
   remote 2001:db8:3::2 \
   key 6
# ip link set netns private dev vti6
# ip netns exec private ip addr add 2001:db8:ff::6/127 dev vti6
# ip netns exec private sysctl -qw net.ipv4.conf.vti6.disable_policy=1
# ip netns exec private sysctl -qw net.ipv4.conf.vti6.disable_xfrm=1
# ip netns exec private ip link set vti6 mtu 1500
# ip netns exec private ip link set vti6 up
The tunnel interface is created in the initial namespace and moved to the private one. It will remember its original namespace where it will process encapsulated packets. Any packet entering the interface will temporarily get a firewall mark of 6 that will be used only to match the appropriate IPsec policy4 below. The kernel sets a low MTU on the interface to handle any possible combination of ciphers and protocols. We set it to 1500 and let PMTUD do its work. We can then configure strongSwan5:
conn V3-2
  left        = 2001:db8:1::1
  leftsubnet  = ::/0
  right       = 2001:db8:3::2
  rightsubnet = ::/0
  authby      = psk
  mark        = 6
  auto        = route
  keyexchange = ikev2
  keyingtries = %forever
  ike         = aes256gcm16-prfsha384-ecp384!
  esp         = aes256gcm16-prfsha384-ecp384!
  mobike      = no
The IKE daemon configures the following policies in the kernel:
$ ip xfrm policy
src ::/0 dst ::/0
        dir out priority 399999 ptype main
        mark 0x6/0xffffffff
        tmpl src 2001:db8:1::1 dst 2001:db8:3::2
                proto esp reqid 1 mode tunnel
src ::/0 dst ::/0
        dir fwd priority 399999 ptype main
        mark 0x6/0xffffffff
        tmpl src 2001:db8:3::2 dst 2001:db8:1::1
                proto esp reqid 1 mode tunnel
src ::/0 dst ::/0
        dir in priority 399999 ptype main
        mark 0x6/0xffffffff
        tmpl src 2001:db8:3::2 dst 2001:db8:1::1
                proto esp reqid 1 mode tunnel
[ ]
Those policies are used for any source or destination as long as the firewall mark is equal to 6, which matches the mark configured for the tunnel interface. The last step is to configure BGP to exchange routes. We can use BIRD for this:
router id 1.0.1.1;
protocol device  
   scan time 10;
 
protocol kernel  
   persist;
   learn;
   import all;
   export all;
   merge paths yes;
 
protocol bgp IBGP_V3_2  
   local 2001:db8:ff::6 as 65001;
   neighbor 2001:db8:ff::7 as 65003;
   import all;
   export where ifname ~ "eth*";
   preference 160;
   hold time 6;
 
Once BIRD is started in the private namespace, we can check routes are learned correctly:
$ ip netns exec private ip -6 route show 2001:db8:a3::/64
2001:db8:a3::/64 proto bird metric 1024
        nexthop via 2001:db8:ff::5  dev vti5 weight 1
        nexthop via 2001:db8:ff::7  dev vti6 weight 1
The above route was learnt from both V3-1 (through vti5) and V3-2 (through vti6). Like for the JunOS version, there is no route-leaking between the private namespace and the initial one. The VPN cannot be used as a gateway between the two namespaces, only for encapsulation. This also prevent a misconfiguration (for example, IKE daemon not running) from allowing packets to leave the private network. As a bonus, unencrypted traffic can be observed with tcpdump on the tunnel interface:
$ ip netns exec private tcpdump -pni vti6 icmp6
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on vti6, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
20:51:15.258708 IP6 2001:db8:a1::1 > 2001:db8:a3::1: ICMP6, echo request, seq 69
20:51:15.260874 IP6 2001:db8:a3::1 > 2001:db8:a1::1: ICMP6, echo reply, seq 69
You can find all the configuration files for this example on GitHub. The documentation of strongSwan also features a page about route-based VPNs.

  1. Everything in this post should work with Libreswan.
  2. fwd is for incoming packets on non-local addresses. It only makes sense in transport mode and is a Linux-only particularity.
  3. Virtual tunnel interfaces (VTI) were introduced in Linux 3.6 (for IPv4) and Linux 3.12 (for IPv6). Appropriate namespace support was added in 3.15. KLIPS, an alternative out-of-tree stack available since Linux 2.2, also features tunnel interfaces.
  4. The mark is set right before doing a policy lookup and restored after that. Consequently, it doesn t affect other possible uses (filtering, routing). However, as Netfilter can also set a mark, one should be careful for conflicts.
  5. The ciphers used here are the strongest ones currently possible while keeping compatibility with JunOS. The documentation for strongSwan contains a complete list of supported algorithms as well as security recommendations to choose them.

20 August 2017

Vincent Bernat: IPv6 route lookup on Linux

TL;DR: With its implementation of IPv6 routing tables using radix trees, Linux offers subpar performance (450 ns for a full view 40,000 routes) compared to IPv4 (50 ns for a full view 500,000 routes) but fair memory usage (20 MiB for a full view).
In a previous article, we had a look at IPv4 route lookup on Linux. Let s see how different IPv6 is.

Lookup trie implementation Looking up a prefix in a routing table comes down to find the most specific entry matching the requested destination. A common structure for this task is the trie, a tree structure where each node has its parent as prefix. With IPv4, Linux uses a level-compressed trie (or LPC-trie), providing good performances with low memory usage. For IPv6, Linux uses a more classic radix tree (or Patricia trie). There are three reasons for not sharing:
  • The IPv6 implementation (introduced in Linux 2.1.8, 1996) predates the IPv4 implementation based on LPC-tries (in Linux 2.6.13, commit 19baf839ff4a).
  • The feature set is different. Notably, IPv6 supports source-specific routing1 (since Linux 2.1.120, 1998).
  • The IPv4 address space is denser than the IPv6 address space. Level-compression is therefore quite efficient with IPv4. This may not be the case with IPv6.
The trie in the below illustration encodes 6 prefixes: Radix tree For more in-depth explanation on the different ways to encode a routing table into a trie and a better understanding of radix trees, see the explanations for IPv4. The following figure shows the in-memory representation of the previous radix tree. Each node corresponds to a struct fib6_node. When a node has the RTN_RTINFO flag set, it embeds a pointer to a struct rt6_info containing information about the next-hop. Memory representation of a routing table The fib6_lookup_1() function walks the radix tree in two steps:
  1. walking down the tree to locate the potential candidate, and
  2. checking the candidate and, if needed, backtracking until a match.
Here is a slightly simplified version without source-specific routing:
static struct fib6_node *fib6_lookup_1(struct fib6_node *root,
                                       struct in6_addr  *addr)
 
    struct fib6_node *fn;
    __be32 dir;
    /* Step 1: locate potential candidate */
    fn = root;
    for (;;)  
        struct fib6_node *next;
        dir = addr_bit_set(addr, fn->fn_bit);
        next = dir ? fn->right : fn->left;
        if (next)  
            fn = next;
            continue;
         
        break;
     
    /* Step 2: check prefix and backtrack if needed */
    while (fn)  
        if (fn->fn_flags & RTN_RTINFO)  
            struct rt6key *key;
            key = fn->leaf->rt6i_dst;
            if (ipv6_prefix_equal(&key->addr, addr, key->plen))  
                if (fn->fn_flags & RTN_RTINFO)
                    return fn;
             
         
        if (fn->fn_flags & RTN_ROOT)
            break;
        fn = fn->parent;
     
    return NULL;
 

Caching While IPv4 lost its route cache in Linux 3.6 (commit 5e9965c15ba8), IPv6 still has a caching mechanism. However cache entries are directly put in the radix tree instead of a distinct structure. Since Linux 2.1.30 (1997) and until Linux 4.2 (commit 45e4fd26683c), almost any successful route lookup inserts a cache entry in the radix tree. For example, a router forwarding a ping between 2001:db8:1::1 and 2001:db8:3::1 would get those two cache entries:
$ ip -6 route show cache
2001:db8:1::1 dev r2-r1  metric 0
    cache
2001:db8:3::1 via 2001:db8:2::2 dev r2-r3  metric 0
    cache
These entries are cleaned up by the ip6_dst_gc() function controlled by the following parameters:
$ sysctl -a   grep -F net.ipv6.route
net.ipv6.route.gc_elasticity = 9
net.ipv6.route.gc_interval = 30
net.ipv6.route.gc_min_interval = 0
net.ipv6.route.gc_min_interval_ms = 500
net.ipv6.route.gc_thresh = 1024
net.ipv6.route.gc_timeout = 60
net.ipv6.route.max_size = 4096
net.ipv6.route.mtu_expires = 600
The garbage collector is triggered at most every 500 ms when there are more than 1024 entries or at least every 30 seconds. The garbage collection won t run for more than 60 seconds, except if there are more than 4096 routes. When running, it will first delete entries older than 30 seconds. If the number of cache entries is still greater than 4096, it will continue to delete more recent entries (but no more recent than 512 jiffies, which is the value of gc_elasticity) after a 500 ms pause. Starting from Linux 4.2 (commit 45e4fd26683c), only a PMTU exception would create a cache entry. A router doesn t have to handle those exceptions, so only hosts would get cache entries. And they should be pretty rare. Martin KaFai Lau explains:
Out of all IPv6 RTF_CACHE routes that are created, the percentage that has a different MTU is very small. In one of our end-user facing proxy server, only 1k out of 80k RTF_CACHE routes have a smaller MTU. For our DC traffic, there is no MTU exception.
Here is how a cache entry with a PMTU exception looks like:
$ ip -6 route show cache
2001:db8:1::50 via 2001:db8:1::13 dev out6  metric 0
    cache  expires 573sec mtu 1400 pref medium

Performance We consider three distinct scenarios:
Excerpt of an Internet full view
In this scenario, Linux acts as an edge router attached to the default-free zone. Currently, the size of such a routing table is a little bit above 40,000 routes.
/48 prefixes spread linearly with different densities
Linux acts as a core router inside a datacenter. Each customer or rack gets one or several /48 networks, which need to be routed around. With a density of 1, /48 subnets are contiguous.
/128 prefixes spread randomly in a fixed /108 subnet
Linux acts as a leaf router for a /64 subnet with hosts getting their IP using autoconfiguration. It is assumed all hosts share the same OUI and therefore, the first 40 bits are fixed. In this scenario, neighbor reachability information for the /64 subnet are converted into routes by some external process and redistributed among other routers sharing the same subnet2.

Route lookup performance With the help of a small kernel module, we can accurately benchmark3 the ip6_route_output_flags() function and correlate the results with the radix tree size: Maximum depth and lookup time Getting meaningful results is challenging due to the size of the address space. None of the scenarios have a fallback route and we only measure time for successful hits4. For the full view scenario, only the range from 2400::/16 to 2a06::/16 is scanned (it contains more than half of the routes). For the /128 scenario, the whole /108 subnet is scanned. For the /48 scenario, the range from the first /48 to the last one is scanned. For each range, 5000 addresses are picked semi-randomly. This operation is repeated until we get 5000 hits or until 1 million tests have been executed. The relation between the maximum depth and the lookup time is incomplete and I can t explain the difference of performance between the different densities of the /48 scenario. We can extract two important performance points:
  • With a full view, the lookup time is 450 ns. This is almost ten times the budget for forwarding at 10 Gbps which is about 50 ns.
  • With an almost empty routing table, the lookup time is 150 ns. This is still over the time budget for forwarding at 10 Gbps.
With IPv4, the lookup time for an almost empty table was 20 ns while the lookup time for a full view (500,000 routes) was a bit above 50 ns. How to explain such a difference? First, the maximum depth of the IPv4 LPC-trie with 500,000 routes was 6, while the maximum depth of the IPv6 radix tree for 40,000 routes is 40. Second, while both IPv4 s fib_lookup() and IPv6 s ip6_route_output_flags() functions have a fixed cost implied by the evaluation of routing rules, IPv4 has several optimizations when the rules are left unmodified5. Those optimizations are removed on the first modification. If we cancel those optimizations, the lookup time for IPv4 is impacted by about 30 ns. This still leaves a 100 ns difference with IPv6 to be explained. Let s compare how time is spent in each lookup function. Here is a CPU flamegraph for IPv4 s fib_lookup(): IPv4 route lookup flamegraph Only 50% of the time is spent in the actual route lookup. The remaining time is spent evaluating the routing rules (about 30 ns). This ratio is dependent on the number of routes we inserted (only 1000 in this example). It should be noted the fib_table_lookup() function is executed twice: once with the local routing table and once with the main routing table. The equivalent flamegraph for IPv6 s ip6_route_output_flags() is depicted below: IPv6 route lookup flamegraph Here is an approximate breakdown on the time spent:
  • 50% is spent in the route lookup in the main table,
  • 15% is spent in handling locking (IPv4 is using the more efficient RCU mechanism),
  • 5% is spent in the route lookup of the local table,
  • most of the remaining is spent in routing rule evaluation (about 100 ns)6.
Why does the evaluation of routing rules is less efficient with IPv6? Again, I don t have a definitive answer.

History The following graph shows the performance progression of route lookups through Linux history: IPv6 route lookup performance progression All kernels are compiled with GCC 4.9 (from Debian Jessie). This version is able to compile older kernels as well as current ones. The kernel configuration is the default one with CONFIG_SMP, CONFIG_IPV6, CONFIG_IPV6_MULTIPLE_TABLES and CONFIG_IPV6_SUBTREES options enabled. Some other unrelated options are enabled to be able to boot them in a virtual machine and run the benchmark. There are three notable performance changes:
  • In Linux 3.1, Eric Dumazet delays a bit the copy of route metrics to fix the undesirable sharing of route-specific metrics by all cache entries (commit 21efcfa0ff27). Each cache entry now gets its own metrics, which explains the performance hit for the non-/128 scenarios.
  • In Linux 3.9, Yoshifuji Hideaki removes the reference to the neighbor entry in struct rt6_info (commit 887c95cc1da5). This should have lead to a performance increase. The small regression may be due to cache-related issues.
  • In Linux 4.2, Martin KaFai Lau prevents the creation of cache entries for most route lookups. The most sensible performance improvement comes with commit 4b32b5ad31a6. The second one is from commit 45e4fd26683c, which effectively removes creation of cache entries, except for PMTU exceptions.

Insertion performance Another interesting performance-related metric is the insertion time. Linux is able to insert a full view in less than two seconds. For some reason, the insertion time is not linear above 50,000 routes and climbs very fast to 60 seconds for 500,000 routes. Insertion time Despite its more complex insertion logic, the IPv4 subsystem is able to insert 2 million routes in less than 10 seconds.

Memory usage Radix tree nodes (struct fib6_node) and routing information (struct rt6_info) are allocated with the slab allocator7. It is therefore possible to extract the information from /proc/slabinfo when the kernel is booted with the slab_nomerge flag:
# sed -ne 2p -e '/^ip6_dst/p' -e '/^fib6_nodes/p' /proc/slabinfo   cut -f1 -d:
   name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab>
fib6_nodes         76101  76104     64   63    1
ip6_dst_cache      40090  40090    384   10    1
In the above example, the used memory is 76104 64+40090 384 bytes (about 20 MiB). The number of struct rt6_info matches the number of routes while the number of nodes is roughly twice the number of routes: Nodes The memory usage is therefore quite predictable and reasonable, as even a small single-board computer can support several full views (20 MiB for each): Memory usage The LPC-trie used for IPv4 is more efficient: when 512 MiB of memory is needed for IPv6 to store 1 million routes, only 128 MiB are needed for IPv4. The difference is mainly due to the size of struct rt6_info (336 bytes) compared to the size of IPv4 s struct fib_alias (48 bytes): IPv4 puts most information about next-hops in struct fib_info structures that are shared with many entries.

Conclusion The takeaways from this article are:
  • upgrade to Linux 4.2 or more recent to avoid excessive caching,
  • route lookups are noticeably slower compared to IPv4 (by an order of magnitude),
  • CONFIG_IPV6_MULTIPLE_TABLES option incurs a fixed penalty of 100 ns by lookup,
  • memory usage is fair (20 MiB for 40,000 routes).
Compared to IPv4, IPv6 in Linux doesn t foster the same interest, notably in term of optimizations. Hopefully, things are changing as its adoption and use at scale are increasing.

  1. For a given destination prefix, it s possible to attach source-specific prefixes:
    ip -6 route add 2001:db8:1::/64 \
      from 2001:db8:3::/64 \
      via fe80::1 \
      dev eth0
    
    Lookup is first done on the destination address, then on the source address.
  2. This is quite different of the classic scenario where Linux acts as a gateway for a /64 subnet. In this case, the neighbor subsystem stores the reachability information for each host and the routing table only contains a single /64 prefix.
  3. The measurements are done in a virtual machine with one vCPU and no neighbors. The host is an Intel Core i5-4670K running at 3.7 GHz during the experiment (CPU governor set to performance). The benchmark is single-threaded. Many lookups are performed and the result reported is the median value. Timings of individual runs are computed from the TSC.
  4. Most of the packets in the network are expected to be routed to a destination. However, this also means the backtracking code path is not used in the /128 and /48 scenarios. Having a fallback route gives far different results and make it difficult to ensure we explore the address space correctly.
  5. The exact same optimizations could be applied for IPv6. Nobody did it yet.
  6. Compiling out table support effectively removes those last 100 ns.
  7. There is also per-CPU pointers allocated directly (4 bytes per entry per CPU on a 64-bit architecture). We ignore this detail.

3 July 2017

Vincent Bernat: Performance progression of IPv4 route lookup on Linux

TL;DR: Each of Linux 2.6.39, 3.6 and 4.0 brings notable performance improvements for the IPv4 route lookup process.
In a previous article, I explained how Linux implements an IPv4 routing table with compressed tries to offer excellent lookup times. The following graph shows the performance progression of Linux through history: IPv4 route lookup performance Two scenarios are tested: All kernels are compiled with GCC 4.9 (from Debian Jessie). This version is able to compile older kernels1 as well as current ones. The kernel configuration used is the default one with CONFIG_SMP and CONFIG_IP_MULTIPLE_TABLES options enabled (however, no IP rules are used). Some other unrelated options are enabled to be able to boot them in a virtual machine and run the benchmark. The measurements are done in a virtual machine with one vCPU2. The host is an Intel Core i5-4670K and the CPU governor was set to performance . The benchmark is single-threaded. Implemented as a kernel module, it calls fib_lookup() with various destinations in 100,000 timed iterations and keeps the median. Timings of individual runs are computed from the TSC (and converted to nanoseconds by assuming a constant clock). The following kernel versions bring a notable performance improvement:

  1. Compiling old kernels with an updated userland may still require some small patches.
  2. The kernels are compiled with the CONFIG_SMP option to use the hierarchical RCU and activate more of the same code paths as actual routers. However, progress on parallelism are left unnoticed.

21 June 2017

Vincent Bernat: IPv4 route lookup on Linux

TL;DR: With its implementation of IPv4 routing tables using LPC-tries, Linux offers good lookup performance (50 ns for a full view) and low memory usage (64 MiB for a full view).
During the lifetime of an IPv4 datagram inside the Linux kernel, one important step is the route lookup for the destination address through the fib_lookup() function. From essential information about the datagram (source and destination IP addresses, interfaces, firewall mark, ), this function should quickly provide a decision. Some possible options are: Since 2.6.39, Linux stores routes into a compressed prefix tree (commit 3630b7c050d9). In the past, a route cache was maintained but it has been removed1 in Linux 3.6.

Route lookup in a trie Looking up a route in a routing table is to find the most specific prefix matching the requested destination. Let s assume the following routing table:
$ ip route show scope global table 100
default via 203.0.113.5 dev out2
192.0.2.0/25
        nexthop via 203.0.113.7  dev out3 weight 1
        nexthop via 203.0.113.9  dev out4 weight 1
192.0.2.47 via 203.0.113.3 dev out1
192.0.2.48 via 203.0.113.3 dev out1
192.0.2.49 via 203.0.113.3 dev out1
192.0.2.50 via 203.0.113.3 dev out1
Here are some examples of lookups and the associated results:
Destination IP Next hop
192.0.2.49 203.0.113.3 via out1
192.0.2.50 203.0.113.3 via out1
192.0.2.51 203.0.113.7 via out3 or 203.0.113.9 via out4 (ECMP)
192.0.2.200 203.0.113.5 via out2
A common structure for route lookup is the trie, a tree structure where each node has its parent as prefix.

Lookup with a simple trie The following trie encodes the previous routing table: Simple routing trie For each node, the prefix is known by its path from the root node and the prefix length is the current depth. A lookup in such a trie is quite simple: at each step, fetch the nth bit of the IP address, where n is the current depth. If it is 0, continue with the first child. Otherwise, continue with the second. If a child is missing, backtrack until a routing entry is found. For example, when looking for 192.0.2.50, we will find the result in the corresponding leaf (at depth 32). However for 192.0.2.51, we will reach 192.0.2.50/31 but there is no second child. Therefore, we backtrack until the 192.0.2.0/25 routing entry. Adding and removing routes is quite easy. From a performance point of view, the lookup is done in constant time relative to the number of routes (due to maximum depth being capped to 32). Quagga is an example of routing software still using this simple approach.

Lookup with a path-compressed trie In the previous example, most nodes only have one child. This leads to a lot of unneeded bitwise comparisons and memory is also wasted on many nodes. To overcome this problem, we can use path compression: each node with only one child is removed (except if it also contains a routing entry). Each remaining node gets a new property telling how many input bits should be skipped. Such a trie is also known as a Patricia trie or a radix tree. Here is the path-compressed version of the previous trie: Patricia trie Since some bits have been ignored, on a match, a final check is executed to ensure all bits from the found entry are matching the input IP address. If not, we must act as if the entry wasn t found (and backtrack to find a matching prefix). The following figure shows two IP addresses matching the same leaf: Lookup in a Patricia trie The reduction on the average depth of the tree compensates the necessity to handle those false positives. The insertion and deletion of a routing entry is still easy enough. Many routing systems are using Patricia trees:

Lookup with a level-compressed trie In addition to path compression, level compression2 detects parts of the trie that are densily populated and replace them with a single node and an associated vector of 2k children. This node will handle k input bits instead of just one. For example, here is a level-compressed version our previous trie: Level-compressed trie Such a trie is called LC-trie or LPC-trie and offers higher lookup performances compared to a radix tree. An heuristic is used to decide how many bits a node should handle. On Linux, if the ratio of non-empty children to all children would be above 50% when the node handles an additional bit, the node gets this additional bit. On the other hand, if the current ratio is below 25%, the node loses the responsibility of one bit. Those values are not tunable. Insertion and deletion becomes more complex but lookup times are also improved.

Implementation in Linux The implementation for IPv4 in Linux exists since 2.6.13 (commit 19baf839ff4a) and is enabled by default since 2.6.39 (commit 3630b7c050d9). Here is the representation of our example routing table in memory3: Memory representation of a trie There are several structures involved: The trie can be retrieved through /proc/net/fib_trie:
$ cat /proc/net/fib_trie
Id 100:
  +-- 0.0.0.0/0 2 0 2
      -- 0.0.0.0
        /0 universe UNICAST
     +-- 192.0.2.0/26 2 0 1
         -- 192.0.2.0
           /25 universe UNICAST
         -- 192.0.2.47
           /32 universe UNICAST
        +-- 192.0.2.48/30 2 0 1
            -- 192.0.2.48
              /32 universe UNICAST
            -- 192.0.2.49
              /32 universe UNICAST
            -- 192.0.2.50
              /32 universe UNICAST
[...]
For internal nodes, the numbers after the prefix are:
  1. the number of bits handled by the node,
  2. the number of full children (they only handle one bit),
  3. the number of empty children.
Moreover, if the kernel was compiled with CONFIG_IP_FIB_TRIE_STATS, some interesting statistics are available in /proc/net/fib_triestat4:
$ cat /proc/net/fib_triestat
Basic info: size of leaf: 48 bytes, size of tnode: 40 bytes.
Id 100:
        Aver depth:     2.33
        Max depth:      3
        Leaves:         6
        Prefixes:       6
        Internal nodes: 3
          2: 3
        Pointers: 12
Null ptrs: 4
Total size: 1  kB
[...]
When a routing table is very dense, a node can handle many bits. For example, a densily populated routing table with 1 million entries packed in a /12 can have one internal node handling 20 bits. In this case, route lookup is essentially reduced to a lookup in a vector. The following graph shows the number of internal nodes used relative to the number of routes for different scenarios (routes extracted from an Internet full view, /32 routes spreaded over 4 different subnets with various densities). When routes are densily packed, the number of internal nodes are quite limited. Internal nodes and null pointers

Performance So how performant is a route lookup? The maximum depth stays low (about 6 for a full view), so a lookup should be quite fast. With the help of a small kernel module, we can accurately benchmark5 the fib_lookup() function: Maximum depth and lookup time The lookup time is loosely tied to the maximum depth. When the routing table is densily populated, the maximum depth is low and the lookup times are fast. When forwarding at 10 Gbps, the time budget for a packet would be about 50 ns. Since this is also the time needed for the route lookup alone in some cases, we wouldn t be able to forward at line rate with only one core. Nonetheless, the results are pretty good and they are expected to scale linearly with the number of cores. The measurements are done with a Linux kernel 4.11 from Debian unstable. I have gathered performance metrics accross kernel versions in Performance progression of IPv4 route lookup on Linux . Another interesting figure is the time it takes to insert all those routes into the kernel. Linux is also quite efficient in this area since you can insert 2 million routes in less than 10 seconds: Insertion time

Memory usage The memory usage is available directly in /proc/net/fib_triestat. The statistic provided doesn t account for the fib_info structures, but you should only have a handful of them (one for each possible next-hop). As you can see on the graph below, the memory use is linear with the number of routes inserted, whatever the shape of the routes is. Memory usage The results are quite good. With only 256 MiB, about 2 million routes can be stored!

Routing rules Unless configured without CONFIG_IP_MULTIPLE_TABLES, Linux supports several routing tables and has a system of configurable rules to select the table to use. These rules can be configured with ip rule. By default, there are three of them:
$ ip rule show
0:      from all lookup local
32766:  from all lookup main
32767:  from all lookup default
Linux will first lookup for a match in the local table. If it doesn t find one, it will lookup in the main table and at last resort, the default table.

Builtin tables The local table contains routes for local delivery:
$ ip route show table local
broadcast 127.0.0.0 dev lo proto kernel scope link src 127.0.0.1
local 127.0.0.0/8 dev lo proto kernel scope host src 127.0.0.1
local 127.0.0.1 dev lo proto kernel scope host src 127.0.0.1
broadcast 127.255.255.255 dev lo proto kernel scope link src 127.0.0.1
broadcast 192.168.117.0 dev eno1 proto kernel scope link src 192.168.117.55
local 192.168.117.55 dev eno1 proto kernel scope host src 192.168.117.55
broadcast 192.168.117.63 dev eno1 proto kernel scope link src 192.168.117.55
This table is populated automatically by the kernel when addresses are configured. Let s look at the three last lines. When the IP address 192.168.117.55 was configured on the eno1 interface, the kernel automatically added the appropriate routes:
  • a route for 192.168.117.55 for local unicast delivery to the IP address,
  • a route for 192.168.117.255 for broadcast delivery to the broadcast address,
  • a route for 192.168.117.0 for broadcast delivery to the network address.
When 127.0.0.1 was configured on the loopback interface, the same kind of routes were added to the local table. However, a loopback address receives a special treatment and the kernel also adds the whole subnet to the local table. As a result, you can ping any IP in 127.0.0.0/8:
$ ping -c1 127.42.42.42
PING 127.42.42.42 (127.42.42.42) 56(84) bytes of data.
64 bytes from 127.42.42.42: icmp_seq=1 ttl=64 time=0.039 ms
--- 127.42.42.42 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.039/0.039/0.039/0.000 ms
The main table usually contains all the other routes:
$ ip route show table main
default via 192.168.117.1 dev eno1 proto static metric 100
192.168.117.0/26 dev eno1 proto kernel scope link src 192.168.117.55 metric 100
The default route has been configured by some DHCP daemon. The connected route (scope link) has been automatically added by the kernel (proto kernel) when configuring an IP address on the eno1 interface. The default table is empty and has little use. It has been kept when the current incarnation of advanced routing has been introduced in Linux 2.1.68 after a first tentative using classes in Linux 2.1.156.

Performance Since Linux 4.1 (commit 0ddcf43d5d4a), when the set of rules is left unmodified, the main and local tables are merged and the lookup is done with this single table (and the default table if not empty). Moreover, since Linux 3.0 (commit f4530fa574df), without specific rules, there is no performance hit when enabling the support for multiple routing tables. However, as soon as you add new rules, some CPU cycles will be spent for each datagram to evaluate them. Here is a couple of graphs demonstrating the impact of routing rules on lookup times: Routing rules impact on performance For some reason, the relation is linear when the number of rules is between 1 and 100 but the slope increases noticeably past this threshold. The second graph highlights the negative impact of the first rule (about 30 ns). A common use of rules is to create virtual routers: interfaces are segregated into domains and when a datagram enters through an interface from domain A, it should use routing table A:
# ip rule add iif vlan457 table 10
# ip rule add iif vlan457 blackhole
# ip rule add iif vlan458 table 20
# ip rule add iif vlan458 blackhole
The blackhole rules may be removed if you are sure there is a default route in each routing table. For example, we add a blackhole default with a high metric to not override a regular default route:
# ip route add blackhole default metric 9999 table 10
# ip route add blackhole default metric 9999 table 20
# ip rule add iif vlan457 table 10
# ip rule add iif vlan458 table 20
To reduce the impact on performance when many interface-specific rules are used, interfaces can be attached to VRF instances and a single rule can be used to select the appropriate table:
# ip link add vrf-A type vrf table 10
# ip link set dev vrf-A up
# ip link add vrf-B type vrf table 20
# ip link set dev vrf-B up
# ip link set dev vlan457 master vrf-A
# ip link set dev vlan458 master vrf-B
# ip rule show
0:      from all lookup local
1000:   from all lookup [l3mdev-table]
32766:  from all lookup main
32767:  from all lookup default
The special l3mdev-table rule was automatically added when configuring the first VRF interface. This rule will select the routing table associated to the VRF owning the input (or output) interface. VRF was introduced in Linux 4.3 (commit 193125dbd8eb), the performance was greatly enhanced in Linux 4.8 (commit 7889681f4a6c) and the special routing rule was also introduced in Linux 4.8 (commit 96c63fa7393d, commit 1aa6c4f6b8cd). You can find more details about it in the kernel documentation.

Conclusion The takeaways from this article are:
  • route lookup times hardly increase with the number of routes,
  • densily packed /32 routes lead to amazingly fast route lookups,
  • memory use is low (128 MiB par million routes),
  • no optimization is done on routing rules.

  1. The routing cache was subject to reasonably easy to launch denial of service attacks. It was also believed to not be efficient for high volume sites like Google but I have first-hand experience it was not the case for moderately high volume sites.
  2. IP-address lookup using LC-tries , IEEE Journal on Selected Areas in Communications, 17(6):1083-1092, June 1999.
  3. For internal nodes, the key_vector structure is embedded into a tnode structure. This structure contains information rarely used during lookup, notably the reference to the parent that is usually not needed for backtracking as Linux keeps the nearest candidate in a variable.
  4. One leaf can contain several routes (struct fib_alias is a list). The number of prefixes can therefore be greater than the number of leaves. The system also keeps statistics about the distribution of the internal nodes relative to the number of bits they handle. In our example, all the three internal nodes are handling 2 bits.
  5. The measurements are done in a virtual machine with one vCPU. The host is an Intel Core i5-4670K running at 3.7 GHz during the experiment (CPU governor was set to performance). The benchmark is single-threaded. It runs a warm-up phase, then executes about 100,000 timed iterations and keeps the median. Timings of individual runs are computed from the TSC.
  6. Fun fact: the documentation of this first tentative of more flexible routing is still available in today s kernel tree and explains the usage of the default class .

Next.

Previous.